Are you ready for big data? Do you need to crawl and scrape big data in massive amounts? In ProxyCrawl we have the tools and resources for this job. Continue reading this post to learn how to easily build your crawlers to be able to load millions of pages per day.

The first thing we will need is obviously a ProxyCrawl account.

Once you’ve got the account ready and with your billing details added (which is a requirement to be able to use large volume Crawlers), head over to the Crawlers section to create your first crawler.

Here is the control panel for your crawlers, you will be able to see, stop, start, delete and create your ProxyCrawl crawlers.

Creating your first crawler

Creating a crawler is very easy, once you are in the Crawlers section (see above), you just have to click on “Create new TCP crawler” if you want to load websites without javascript, or “Create new JS crawler” if you want to crawl javascript enabled websites (like the ones made with React, Angular, Backbone, etc).

You will see something like the following:

You should write a name for your crawler. For this example, let’s call it “amazon” as we will be crawling amazon pages.

Next field is the callback url. This is your server which we will implement in Node for this example. But you can use any language: Ruby, PHP, Go, Node, Python or any other programming language. As we said, for this example we will use a node server which for demo purposes will be in the following url: http://mywebsite.com/amazon-crawler

So our settings will look like the following:

Now let’s save the crawler with “Create crawler” and let’s build our node server.

Building a node scraping server

Let’s start with the basic code for a node server. Create a server.js file with the following contents:

1
2
3
4
5
6
7
8
9
const http = require('http');

function handleRequest(request, response) {
response.end();
}

const server = http.createServer(handleRequest);
server.listen(80, () => console.log('Server running on port 80'));
server.on('error', (err) => console.log('Error on server: ', err));

This is a basic server running in port 80. We will build our response handling in the handleRequest function. If your port runs in a different port, for example 4321. Make sure to update the callback url in your crawler accordingly. For example: http://mywebsite.com:4321/amazon-crawler

Request handling function

ProxyCrawl crawlers will send the html responses to your server via POST. So we have to basically check that the request method is POST and then get the content of the body. That will be the page HTML. Let’s make it simple, this is going to be the code for our request handler:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
function handleRequest(request, response) {
if (request.method !== 'POST') {
return response.end();
}
const requestId = request.headers.rid;
const requestUrl = request.headers.url;
let postData = '';
request.on('data', (data) => postData += data);
request.on('end', () => {
console.log(requestId, requestUrl, postData);
return response.end();
});
request.on('error', () => console.log('Error happened receiving POST data'));
}

With this function, you can start already pushing requests to the crawler you just created before, and you should start seeing responses in your server.

Let’s try running the following command in your terminal (make sure to replace with your real API token which you can find in the API docs):

1
curl "https://api.proxycrawl.com/?token=YOUR_API_TOKEN&url=https%3A%2F%2Fwww.amazon.com&crawler=amazon&callback=true"

Run that command several times, and you will start seeing the logs in your server.

Please note that this is a basic implementation. For a real world usage you will have to consider other things like better error handling and logging and also status codes.

Scraping Amazon data

Now it’s time to get the actual data from the HTML. We have already one blog posts which explains in details how to do it with node. So why don’t you just jump on it to learn about scraping with node right here? The interesting part starts in the “Scraping Amazon reviews” section. You can apply the same code to your server and you will have a running ProxyCrawl Crawler and scraper. Easy right?

We hope you enjoyed this article about crawling and scraping. If you would like to learn more things, don’t hesitate to contact us.