Web crawling, alternatively referenced as web spidering or screen scraping, software developers define it as “writing software to iterate on a set of web pages to extract content”, is a great tool for extracting data from the web for various reasons.

Using a web crawler, you can scrape data from a set of articles, mine a large blog post or scrape quantitative data from Amazon for price monitoring and machine learning, overcome the inability to get content from sites that have no official API, or simply to build your own prototype for the next better web.

In this tutorial, We will teach you the basics of crawling and scraping using ProxyCrawl and Scrapy. As an example, we will use Amazon search result pages to extract product ASIN URLs and titles. When this tutorial is completed, you’ll hopefully have a fully functional web scraper that runs through a series of pages on Amazon and extracts data from each page and prints it to your screen.

The scraper example can be easily extended and be used as a solid layer for your personal projects on crawling and scraping data from the web.

Prerequisites

To complete this tutorial successfully, you’ll need a ProxyCrawl API free token for scraping web pages anonymously, and Python 3 installed in your local machine for development.

Step 1 — Creating the Amazon basic Scraper

Scrapy is a Python scraping library; it includes most of the common tools that will help us when scraping. It speeds up the scraping process and it is maintained by an open source community that loves scraping and crawling the web.

ProxyCrawl has a pyhton scraping library; combined with scrapy, we gurantee that our crawler runs anonymously on big scale without being blocked by sites. ProxyCrawl API is a powerful thin layer that acts on top of any site as a thin middleware.

ProxyCrawl & Scrapy have python packages on PyPI (known as pip). PyPI, the Python Package manager, is maintained by the python community as a repository for various libraries that developers need.

Install ProxyCrawl and Scrapy with the following commands:

1
pip install proxycrawl
1
pip install scrapy

Create a new folder for the scraper:

1
mkdir amazon-scraper

Navigate to the scraper directory that you created above:

1
cd amazon-scraper

Create a Python file for your scraper. All the code of this tutorial will be placed in this file. We use the Touch command in the console for this tutorial, you can use any other editor that you prefer.

1
touch myspider.py

Let us create our first basic scrapy spider AmazonSpider that inherits from scrapy.Spider. As per scrapy documentation, subclasses has 2 required attributes. The name which is a name for our spider and a list of URLs start_urls, we will use one URL for this example. We also import the ProxyCrawl API so that we can build the URLs that will go through the ProxyCrawl API instead of going directly to the original sites to avoid blocks and captcha pages.

Paste the follow code in myspider.py

1
2
3
4
5
6
7
8
9
10
11
12
13
import scrapy

from proxycrawl.proxycrawl_api import ProxyCrawlAPI

# Get the API token from ProxyCrawl and replace it with YOUR_TOKEN
api = ProxyCrawlAPI({ 'token': 'YOUR_TOKEN' })

class AmazonSpider(scrapy.Spider):
name = 'amazonspider'
# Amazon search result URLs to extract ASIN pages and titles
urls = ['https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=cold-brew+coffee+makers&page=1']
# make the URLs go through ProxyCrawl API
start_urls = list(map(lambda url: api.buildURL(url, {}), urls))

Run the scraper which does not extract data yet but you should have it pointed to ProxyCrawl API Endpoint and getting Crawled 200 from Scrapy.

1
scrapy runspider myspider.py

The result should be something like this, notice that the request to Amazon result page over ProxyCrawl passed with 200 response code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
2018-07-09 02:05:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-07-09 02:05:23 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.4.0, Python 2.7.15 (default, May 1 2018, 16:44:37) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Darwin-16.7.0-x86_64-i386-64bit
2018-07-09 02:05:23 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2018-07-09 02:05:23 [scrapy.middleware] INFO: Enabled extensions:
...
2018-07-09 02:05:23 [scrapy.core.engine] INFO: Spider opened
2018-07-09 02:05:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-09 02:05:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-09 02:05:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.proxycrawl.com/?token=TOKEN_HIDDEN&url=https%3A%2F%2Fwww.amazon.com%2Fs%2Fref%3Dnb_sb_noss%3Furl%3Dsearch-alias%253Daps%26field-keywords%3Dcold-brew%2Bcoffee%2Bmakers%26page%3D1&> (referer: None)
NotImplementedError: AmazonSpider.parse callback is not defined
2018-07-09 02:05:25 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-09 02:05:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 390,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 146800,
'downloader/response_count': 1,
...
'start_time': datetime.datetime(2018, 7, 8, 23, 5, 23, 681766)}
2018-07-09 02:05:25 [scrapy.core.engine] INFO: Spider closed (finished)

Since we did not write the parser yet, the code simply loaded the start_urls which is just one URL to Amazon search results over ProxyCrawl API and returned the result to the default parser which does not do anything by default. It is time now to do our next step and write our simple parser to extract the data we need.

Step 2 - Scraping Amazon ASIN URLs and Titles

Let us enhance the myspider class with a simple parser that extracts the URLs and Titles of all ASIN products on the search result page. For that we need to know which css selectors we need to use to instruct scrapy to fetch the data from those tags. At the time of writing this tutorial. the ASIN URLs are found in the .a-link-normal css selector

Enhancing our spider class with our simple parser will give us the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import scrapy

from proxycrawl.proxycrawl_api import ProxyCrawlAPI

# Get the API token from ProxyCrawl and replace it with YOUR_TOKEN
api = ProxyCrawlAPI({ 'token': 'YOUR_TOKEN' })

class AmazonSpider(scrapy.Spider):
name = 'amazonspider'
# Amazon search result URLs
urls = ['https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=cold-brew+coffee+makers&page=1']
# make the URLS to go through PROXYCRAWL API
start_urls = list(map(lambda url: api.buildURL(url, {}), urls))

def parse(self, response):
for link in response.css('li.s-result-item'):
yield {
'url': link.css('a.a-link-normal.a-text-normal ::attr(href)').extract_first(),
'title': link.css('a.a-link-normal ::text').extract_first()
}

Running scrapy again should print us some nice URLs of ASIN pages and their titles. 😄

1
2
3
4
5
6
...
2018-07-09 04:01:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.proxycrawl.com/?token=TOKEN_HIDDEN&url=https%3A%2F%2Fwww.amazon.com%2Fs%2Fref%3Dnb_sb_noss%3Furl%3Dsearch-alias%253Daps%26field-keywords%3Dcold-brew%2Bcoffee%2Bmakers%26page%3D1&>
{'url': 'https://www.amazon.com/Airtight-Coffee-Maker-Infuser-Spout/dp/B01CTIYU60/ref=sr_1_5/135-1769709-1970912?ie=UTF8&qid=1531098077&sr=8-5&keywords=cold-brew+coffee+makers', 'title': 'Airtight Cold Brew Iced Coffee Maker and Tea Infuser with Spout - 1.0L/34oz Ovalware RJ3 Brewing Glass Carafe with Removable Stainless Steel Filter'}
2018-07-09 04:01:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://api.proxycrawl.com/?token=TOKEN_HIDDEN&url=https%3A%2F%2Fwww.amazon.com%2Fs%2Fref%3Dnb_sb_noss%3Furl%3Dsearch-alias%253Daps%26field-keywords%3Dcold-brew%2Bcoffee%2Bmakers%26page%3D1&>
{'url': 'https://www.amazon.com/KitchenAid-KCM4212SX-Coffee-Brushed-Stainless/dp/B06XNVZDC7/ref=sr_1_6/135-1769709-1970912?ie=UTF8&qid=1531098077&sr=8-6&keywords=cold-brew+coffee+makers', 'title': 'KitchenAid KCM4212SX Cold Brew Coffee Maker, Brushed Stainless Steel'}
...

Result & Summaries

In this tutorial, we learnt the fundamentals of web crawling and scraping and we used ProxyCrawl API combined with scrapy to keep our scraper undetected by sites that might block our requests. We also learnt the basics of how to extract Amazon product pages from Amazon search result pages. You can change the example to any site that you require and start crawling thousands of pages anonymously with Scrapy & ProxyCrawl.

We hope you enjoyed this tutorial and we hope to see you soon in ProxyCrawl. Happy Crawling and Scraping!