Have you ever wanted to scrape javascript websites? What do we mean by javascript enabled sites?
React js, Angular, Vue, Meteor or any other website which is built dynamically or that uses ajax to loads its content.

So if you were ever stuck at crawling and scraping ajax websites or javascript websites, this article would help you.

This is a hands-on article, so if you want to follow it, make sure that you have an account in ProxyCrawl. It’s straightforward to obtain it, and free. So go ahead and create one here.

Getting the proper javascript URL to crawl

Upon registering in ProxyCrawl, you will see that we don’t have any complex interface where you add the URLs that you want to crawl. We created a simple and easy to use API that you can call at any time.

So let’s say we want to crawl and scrape the information of the following page which is created entirely in React js. This will be the URL that we will use for demo purposes: https://ahfarmer.github.io/emoji-search/

If you try to load that URL from your console or terminal, you will see that you don’t get all the HTML code from the page. That is because the code is rendered on the client side by React, so with a regular curl command, where there is no browser, that code is not being executed.
You can do the test with the following command in your terminal:

1
curl https://ahfarmer.github.io/emoji-search/

So how can we scrape javascript easily with ProxyCrawl?

First, we will go to my account page where we will find two tokens, the regular token, and the javascript token.

As we are dealing with a javascript rendered website, we will be using the javascript token.

For this tutorial, we will use the following demo token: 5aA5rambtJS2 but if you are following the tutorial, make sure to get yours from the my account page.

First, we need to make sure that we escape the URL so if there is any special character, it won’t collide with the rest of the API call.
For example, if we are using Ruby, we could do the following:

1
2
require 'cgi'
CGI.escape("https://ahfarmer.github.io/emoji-search/")

This will bring back the following:

1
https%3A%2F%2Fahfarmer.github.io%2Femoji-search%2F

Great! We have our javascript website ready to be scraped with ProxyCrawl.

Scraping the javascript content

The next thing that we have to do is to make the actual request to get the javascript rendered content.

The ProxyCrawl API will do that for us. We just have to do a request to the following URL: https://api.proxycrawl.com/?token=YOUR_TOKEN&url=THE_URL

So you will need to replace YOUR_TOKEN with your token :) (remember, for this tutorial, we will use the following: 5aA5rambtJS2) and THE_URL will have to be replaced by the URL we just encoded.

Let’s do it in ruby!

1
2
3
4
5
6
require 'net/http'
uri = URI('https://api.proxycrawl.com/?token=5aA5rambtJS2&url=https%3A%2F%2Fahfarmer.github.io%2Femoji-search%2F')
response = Net::HTTP.get_response(uri)
response['original_status']
response['pc_status']
response.body

Done. We made our first request to a javascript website via ProxyCrawl. Secure, anonymous and without getting blocked!

Now we should have the html from the website back, including the javascript generated content by React which should look like something like:

1
2
3
4
<html lang="en" class="gr__ahfarmer_github_io">
<head><meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1"><link rel="shortcut icon" href="./src/favicon.ico"><title>Emoji Search</title><link href="/emoji-search/static/css/main.2e862781.css" rel="stylesheet"><style></style></head>
<body data-gr-c-s-loaded="true"><div id="root"><div><header class="component-header"><img src="//cdn.jsdelivr.net/emojione/assets/png/1f638.png" width="32" height="32" alt="">Emoji Search<img src="//cdn.jsdelivr.net/emojione/assets/png/1f63a.png" width="32" height="32" alt=""></header><div class="component-search-input"><div><input></div></div><div class="component-emoji-results"><div class="component-emoji-result-row copy-to-clipboard" data-clipboard-text="💯">
...

Scraping javascript website content

There is now, only one part missing which is extracting the actual content from the html.

This can be done in many different ways, and it depends on the language you are using to code your application. We always suggest using one of the many available libraries that are out there.

Here you have some open source libraries that can help you do the scraping part with the returned HTML:

Javascript scraping with Ruby

Javascript scraping with Node

Javascript scraping with Python

We hope you enjoyed this tutorial and we hope to see you soon in ProxyCrawl. Happy crawling!