Crawling websites is not an easy task, especially when you start doing it in thousands or millions of requests, your server will begin to suffer and will get blocked.

As you probably know, ProxyCrawl can help you to avoid this situation, but on this article, we are not going to talk about that, but instead, we are going to check how you can easily scrape and crawl any website.

This is a hands-on tutorial so if you want to follow it, make sure that you have a working account in ProxyCrawl. It’s free so go ahead and create one here.

Extracting the URL

The first thing that you will notice when registering in ProxyCrawl is that we don’t have any fancy interface where you add the URLs that you want to crawl. No, as we want you to have complete freedom. Therefore we created an API that you can call.

So let’s say we want to crawl and scrape the information of the iPhone X on Amazon.com, at the date of today, that would be the product URL: https://www.amazon.com/Apple-iPhone-Fully-Unlocked-5-8/dp/B075QN8NDH/ref=sr_1_6?s=wireless&ie=UTF8&sr=1-6

How can we do to crawl Amazon securely from proxycrawl.com?

We will go first to my account page where we will find two tokens, the regular token, and the javascript token.

Amazon website is not generated with javascript, that means that is not created on the client side, like some sites built with React or Vue. Therefore, we will be using the regular token.

For this tutorial, we will use the following demo token: caA53amvjJ24 but if you are following the tutorial, make sure to get yours from the my account page.

The Amazon URL has some special characters, so we have to make sure that we encode it properly, for example, if we are using Ruby, we could do the following:

1
2
require 'cgi'
CGI.escape("https://www.amazon.com/Apple-iPhone-Fully-Unlocked-5-8/dp/B075QN8NDH/ref=sr_1_6?s=wireless&ie=UTF8&sr=1-6")

This will return the following:

1
https%3A%2F%2Fwww.amazon.com%2FApple-iPhone-Fully-Unlocked-5-8%2Fdp%2FB075QN8NDH%2Fref%3Dsr_1_6%3Fs%3Dwireless%26ie%3DUTF8%26sr%3D1-6

Great! We have our URL ready to be scraped with ProxyCrawl.

Scraping the content

The next thing that we have to do is to make the actual request.

The ProxyCrawl API will help us on that. We just have to do a request to the following URL: https://api.proxycrawl.com/?token=YOUR_TOKEN&url=THE_URL

So we just have to replace YOUR_TOKEN with our token (which is caA53amvjJ24 for demo purposes) and THE_URL for the URL we just encoded.

Let’s do it in ruby!

1
2
3
4
5
6
require 'net/http'
uri = URI('https://api.proxycrawl.com/?token=caA53amvjJ24&url=https%3A%2F%2Fwww.amazon.com%2FApple-iPhone-Fully-Unlocked-5-8%2Fdp%2FB075QN8NDH%2Fref%3Dsr_1_6%3Fs%3Dwireless%26ie%3DUTF8%26qid%3D1522316288%26sr%3D1-6')
response = Net::HTTP.get_response(uri)
response['original_status']
response['pc_status']
response.body

Done. We have made our first request to Amazon via ProxyCrawl. Secure, anonymous and without getting blocked!

Now we should have the html from amazon back, if should look something like this:

1
2
3
4
5
<!doctype html><html class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->
<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8">
<script type='text/javascript'>var ue_t0=ue_t0||+new Date();</script><!-- sp:feature:cs-optimization -->
<meta http-equiv='x-dns-prefetch-control' content='on'><link rel='dns-prefetch' href='//images-na.ssl-images-amazon.com'><link rel='dns-prefetch' href='//m.media-amazon.com'><link rel='dns-prefetch' href='//completion.amazon.com'><script type='text/javascript'>
...

Scraping website content

So now there is only one part missing which is extracting the actual content.

This can be done in a million different ways, and it always depends on the language you are programming. We always suggest using one of the many available libraries that are out there.

Here you have some that can help you do the scraping part with the returned HTML:

Scraping with Ruby

Scraping with Node

Scraping with Python

We hope you enjoyed this tutorial and we hope to see you soon in ProxyCrawl. Happy crawling!