If you need to get the reviews of different products you can quickly do it in Node, as the async features of Node helps you to get the data from Amazon easily.

In this article, we will together scrape Amazon reviews and comments using only a couple of NodeJS libraries.

The first thing that we are going to need is a list of Amazon URLs, for the example here we will use amazon.com URLs. We have collected a sample of around 1,000 product ASIN which you can download from here.

Amazon products list download

Loading Amazon URLs in Node

Let’s create a file start.js which will contain our node code.

Let’s also install our two requirements:

  • npm i cheerio
  • npm i proxycrawl

Now we should have our project structure with at least the following files:

Now it’s time to start coding. Let’s write our code in the start.js file, and we will start by loading the amazon-products.txt file into an array. We can do that with the following piece of code:

1
2
3
4
5
const fs = require('fs');
const file = fs.readFileSync('amazon-products.txt');
const urls = file.toString().split('\n');

console.log(urls);

Now that we have the URLs in an array, we can start crawling them. We will use the ProxyCrawl node library that we installed before.

Crawling Amazon with ProxyCrawl

We need to initialize the library and create a worker with our token. For Amazon, we should use the normal token, make sure to replace it with your actual token from your account.

We have to add the following two lines to our project:

1
2
const { ProxyCrawlAPI } = require('proxycrawl');
const api = new ProxyCrawlAPI({ token: 'YOUR_TOKEN' });

With the resulting code being the following:

1
2
3
4
5
6
const fs = require('fs');
const { ProxyCrawlAPI } = require('proxycrawl');

const file = fs.readFileSync('amazon-products.txt');
const urls = file.toString().split('\n');
const api = new ProxyCrawlAPI({ token: 'YOUR_TOKEN' });

Time now to crawl the URLs, we will do 10 requests each second which should suffice for our test, but if you need more, make sure to contact ProxyCrawl.

Let’s build our code to send 10 API requests each second…

1
2
3
4
5
6
7
8
const requestsPerSecond = 10;
var currentIndex = 0;
setInterval(() => {
for (let i = 0; i < requestsPerSecond; i++) {
api.get(urls[currentIndex]);
currentIndex++;
}
}, 1000);

We are now loading the URLs, but we are not doing anything with the result. So it’s now time to start scraping 😄

Scraping Amazon reviews

We will use Node Cheerio library that we installed before to parse the resulting HTML and extract only the reviews.

Let’s first include cheerio:

1
const cheerio = require('cheerio');

And now let’s build a function which should receive the HTML and parse it accordingly.

1
2
3
4
5
6
7
8
9
10
11
function parseHtml(html) {
// Load the html in cheerio
const $ = cheerio.load(html);
// Load the reviews
const reviews = $('.review');
reviews.each((i, review) => {
// Find the text children
const textReview = $(review).find('.review-text').text();
console.log(textReview);
})
}

So now we have the text content of the reviews, we are close to finishing the scraping, but we are missing the most crucial part, which is to connect our function with the previous piece of code we had. When we did the call to the ProxyCrawl API.

The full code should look like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
const fs = require('fs');
const { ProxyCrawlAPI } = require('proxycrawl');
const cheerio = require('cheerio');

const file = fs.readFileSync('amazon-products.txt');
const urls = file.toString().split('\n');
const api = new ProxyCrawlAPI({ token: 'YOUR_TOKEN' });

function parseHtml(html) {
// Load the html in cheerio
const $ = cheerio.load(html);
// Load the reviews
const reviews = $('.review');
reviews.each((i, review) => {
// Find the text children
const textReview = $(review).find('.review-text').text();
console.log(textReview);
})
}

const requestsPerSecond = 10;
var currentIndex = 0;
setInterval(() => {
for (let i = 0; i < requestsPerSecond; i++) {
api.get(urls[currentIndex]).then(response => {
// Make sure the response is success
if (response.statusCode === 200 && response.originalStatus === 200) {
parseHtml(response.body);
} else {
console.log('Failed: ', response.statusCode, response.originalStatus);
}
});
currentIndex++;
}
}, 1000);

The code is ready, and you can quickly scrape 10 Amazon reviews each second. Obviously, for this post we are just logging in the console the results, you should replace that console.log with whatever you would like to do. It can be saved in a database, save in a file, etc. That is up to you.

Happy crawling!