Have you ever tried to scrape content from any website? Maybe from Yelp?

In today’s post, we are going to see exactly that; we are going to do a step by step process to crawl Yelp reviews so you can quickly start with your crawler by following this tutorial.

Getting the Yelp reviews URL

As always, the first thing that we have to do is to get the URL that we want to crawl.

For this tutorial we will be using the following restaurant reviews:

1
https://www.yelp.com/biz/sushi-yasaka-new-york

As you can see, here are the first reviews that appear when visiting the site as the date of today:

You need also a ProxyCrawl account; if you don’t have one, you can create yours for free here create one here.

Once you have your account and token ready, we are prepared to start.

We will be doing this tutorial in NodeJS but feel free to use any other language as needed.

Loading Yelp reviews

To make things easy with Node, we will be using the request and cheerio open source libraries which can be downloaded from here:

Request will allow us to quickly make HTTP requests in Node, while Cheerio will let us parse the HTML that we get back and scrape the yelp reviews.

So we can proceed to do the following (make sure to use your account token):

1
2
3
4
5
6
7
8
const request = require('request');

request('https://api.proxycrawl.com/?token=YOUR_TOKEN&url=https%3A%2F%2Fwww.yelp.com%2Fbiz%2Fsushi-yasaka-new-york', (error, response, body) => {
if (error || (response && response.statusCode !== 200)) {
return;
}
console.log(body);
});

We are calling the ProxyCrawl API to crawl Yelp without getting blocked or getting captchas.

Scraping Yelp reviews

Now that we have our response code, we can proceed to scrape the actual page content and extract the reviews.

We can quickly do that with Cheerio; we will need first to load the resulted HTML into Cheerio and then to use css3 selectors and the same syntax that we are used for jQuery, extract the reviews.

So our code will look like something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
const request = require('request');
const cheerio = require('cheerio');

request('https://api.proxycrawl.com/?token=YOUR_TOKEN&url=https%3A%2F%2Fwww.yelp.com%2Fbiz%2Fsushi-yasaka-new-york', (error, response, body) => {
if (error || (response && response.statusCode !== 200)) {
return;
}

const $ = cheerio.load(body);
$('.review.review--with-sidebar').each(function() {
console.log($(this).find('.review-content p').html());
});
});

There you go! We have the yelp reviews ready to manipulate and maybe store somewhere like in MongoDB. But that is out of the scope of this tutorial.

Remember that if you aren’t using node but other programming languages like Ruby or PHP. You can easily find HTML parsing libraries to parse the results from ProxyCrawl API.

We hope you enjoyed this tutorial and we hope to see you soon in ProxyCrawl. Happy crawling!