Infinite scrolling works by fetching and rendering new data every time the user scrolls down to the bottom of a page. If you are looking for an easy way to crawl a webpage with continuous or lengthy content that needs scrolling like Facebook groups, Twitter tweets, or even search results in Quora, then this guide can help save your precious time and effort.

In this article, we will show you how to create a simple web crawler that automatically scrolls a webpage using our Crawling API with the scroll parameter. We will be writing our code in Node.js and make this as beginner-friendly as possible. Moreover, we will share advanced 8 web crawling tactics for web data retrieval that really work.

Before we start coding, it is important to know the 3 key elements for this to work:

  • Javascript token: This is a token provided to you upon signing up at Crawlbase and it’s needed to pass the parameters below.
  • &scroll parameter: Passing this to the API will allow your request to scroll the page with an interval of 10 seconds.
  • &scroll_interval: This parameter allows the API to scroll for X seconds after loading the page. The maximum scroll interval is 60 seconds, after 60 seconds of scrolling, the API captures the data and brings it back to you.

Scrolling a website with Node

To begin, open up your command prompt (Windows) or terminal and check if you have Node.js installed on your system by typing node --version and if you do not have Node yet or if it’s already outdated, we recommend to download and install the latest NodeJS version first.

Once you have successfully installed/updated your node, go ahead, and create a folder as shown below:

Create node project

In this instance, we will be using Visual Studio Code as an example but you may also use your favorite code editor.

Create a new file and you can name it quoraScraper.js

VSCode Node creation

Now we can start writing our code. First, we can declare our constant variables so we can properly call the Crawling API with the necessary parameters as shown below:

1
2
3
4
5
6
const https = require('https');
const url = encodeURIComponent('https://www.quora.com/search?q=crawlbase');
const options = {
hostname: 'api.crawlbase.com',
path: '/?token=JS_TOKEN&scraper=quora-serp&scroll=true&url=' + url,
};

Remember that you can swap the URL with any URL that you wish to scrape which should have the corresponding &scraper parameter and the JS_TOKEN with your actual javascript token.

The next part of our code will get the request in JSON format and displays the results in the console:

1
2
3
4
5
6
7
8
9
10
11
12
https
.request(options, (response) => {
let body = '';
response
.on('data', (chunk) => (body += chunk))
.on('end', () => {
const json = JSON.parse(body);
console.log(json.original_status);
console.log(json.body);
});
})
.end();

Once done, press F5 (Windows) to see the result or you may also execute this from the terminal or command prompt:

1
C:\Nodejs\project> node quoraScraper.js

Since we have not set the scroll interval yet, this has defaulted to 10 seconds scrolling which naturally returns fewer data.

Fetching more data with node

Now, if you wish to scroll more (i.e. 20 seconds), you have to set a value on the &scroll_interval parameter. The full code is shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
const https = require('https');
const url = encodeURIComponent('https://www.quora.com/search?q=crawlbase');
const options = {
hostname: 'api.crawlbase.com',
path: '/?token=JS_TOKEN&scraper=quora-serp&scroll=true&scroll_interval=20&url=' + url,
};

https
.request(options, (response) => {
let body = '';
response
.on('data', (chunk) => (body += chunk))
.on('end', () => {
const json = JSON.parse(body);
console.log(json.original_status);
console.log(json.body);
});
})
.end();

Please make sure to keep your connection open up to 90 seconds if you are intending to scroll for 60 seconds. You can find more information about the scroll parameter in our documentation.

If you run the code again, you should get more data as shown on the example below:

JSON output

At this point, we have successfully completed a simple scraper that can scroll through a webpage in less than 20 lines of code. Remember that this can be integrated if you have an existing web scraper and you are also free to use our Crawlbase Nodejs library as an alternative.

Of course, this is just the start, there are lots of things that you can do with this and we hope it has added value to your web scraping knowledge.

What is Web Crawling?

Web crawling is all about grabbing valuable information from websites without manually clicking and copying. To get a bit technical, web crawling involves using special tools or programs to automatically pull data from web pages. It’s like a robot that visits a web page, downloads everything on it, and then picks out the specific information you’re interested in.

What Can Web Crawling Do For You?

Now, you might be curious about why web crawling is such a valuable tool. Well, here’s the deal: web scraping is your great helper in time-saving. It assists you in automatic web data retrieval of all sorts of public information from the vast web. It;’s like a shortcut that beats manual copying any day.

But that’s not all! Web scraping is a handy trick for various tasks, such as:

Competitor Analysis: With web scraping, you can peek into your competitors’ websites and keep tabs on their services, prices, and marketing tactics. It looks like a pass to their business strategies.
Market Research: Suppose you want to know everything about a specific market, industry, or niche. Web scraping can provide all the valuable data for you. It’s highly useful, especially in fields like real estate.
Machine Learning: Now, here’s where web scraping gets even cooler. The data you scrape can become the foundation for your machine learning and AI projects. It’s like strengthening your algorithms with the information to learn and grow from.

Ready to dive into the world of web crawling? We’ll share some top-notch web scraping best practices to ensure you’re on the right track. Let’s get started!

8 Best Advanced Web Crawling Tactics For You

Now, it’s time to discover the ten most significant and best practices for web data retrieval. Whether you’re dealing with a scrolling website, an infinite scroll website, or setting up a live crawler, these data scraping tips and crawling tactics will come in handy in your web scraping activities.

So, let’s get ready to learn best practices for efficient and effective web data retrieval!

1. Be Patient: Don’t Overload

It’s vital to play nice with the servers you’re interacting with. You see, bombarding a server with too many requests in a short span can lead to trouble. The website you’re targeting might not be ready to handle such a heavy load, and that’s where problems can arise.

To avoid this, it’s essential to introduce a pause time between each request your web crawler makes. This breathing space allows your crawler to navigate web pages gracefully without causing any disruptions to other users. After all, nobody wants a slow website, right?

Moreover, firing off a barrage of requests might trigger anti-scraping defenses. These sneaky systems can detect excessive activity and might deny access to your web scraper.

As an extra tip, consider running your crawler during off-peak hours. For instance, web traffic on the target website tends to dwindle at night. It’s one of the golden rules of web scraping best practices, ensuring a smoother experience for all.

2. The Power of Public APIs

Here’s a trick for a smooth web data retrieval process: Go for public APIs. If you’re unfamiliar with the term, API stands for Application Programming Interface. It is like a connection that allows different applications to talk to each other and share data.

Now, many websites rely on these APIs to fetch the data they need for their web pages.

So, how does this help you in your web scraping activities? Well, if the website you’re eyeing operates this way, you’re in luck. You can make these API calls right in your browser’s development tools, under the XHR tab in the Network section.

By intercepting these HTTP requests, you gain access to the data you’re after. Plus, most APIs are user-friendly, allowing you to specify what data you want using body or query parameters. You get exactly what you want and get it in a format that’s easy for humans to understand. Moreover, these APIs can even provide URLs and other valuable information for your web crawling projects.

So, the next time for your web data retrieval, don’t forget to check if there’s a public API waiting to make your life a whole lot easier.

3. Conceal Your IP with Proxy Services

Now, this is the rule of thumb for a successful web data retrieval: never expose your real IP address while scraping. It’s one of the fundamental web scraping best practices. The reason is simple – you don’t want anti-scraping systems to pinpoint your actual IP and block you.

So, how do you stay incognito? Here is a two-word answer: Proxy services. When your scraper sends a request to a website, the proxy server’s IP shows up in the server’s logs, not yours.

The best part is that premium proxy services often offer IP rotation. This means your scraper can constantly switch between different IP addresses. It makes it incredibly challenging for websites to ban your IP because it’s a moving target.

So, remember, when you’re scraping the web, think of proxy services as your basic requirement. They help you in data scraping without revealing your true identity.

4. Introduce Randomness in Your Crawling Pattern

A random crawling pattern is one of the best crawling tactics for safe data scraping and keeping your safe from anti-scraping technologies. Some websites employ advanced anti-scraping techniques that analyze user behavior to distinguish between humans and bots. They look for patterns, and here’s the truth: humans are known for their unpredictability.

To outsmart these watchful anti-scraping websites, you must make your web scraper behave like a human user. How do you do that? By introducing a touch of randomness into your web scraping logic.

Here are a few clever moves:

Random Offset: When your scraper scrolls or clicks, throw in some randomness. Humans don’t move with robotic precision, and neither should your scraper.
Mouse Movements: Mimic the organic movement of a human cursor. A few wiggles here and there can go a long way in blending in.
Click on Random Links: Humans are curious creatures, and click on various links. Encourage your scraper to do the same.

By doing these things, your web scraper appears more human-like in the eyes of anti-scraping technologies. Give your scraper a virtual personality, making it less likely to raise suspicions.

5. Be Aware of Honeypots

Some websites employ clever tricks to hinder your scraping activities – honeypots.

Honeypot traps are hidden links strategically placed where unsuspecting users can’t see them. They’re like pathways that only the initiated can access. These links are often concealed with CSS, setting their display property to “none,” rendering them invisible to the average user.

When your web scraper enters into a honeypot website, it can unwittingly fall into an anti-scraping trap. The anti-scraper system watches your every move, taking notes on your behavior. It’s always gathering evidence to identify and block your scraper.

To steer clear of honeypot websites, always double-check that the website your scraper is targeting is the real deal. Don’t be lured by the promise of fake data.

Moreover, anti-bot systems also keep a watchful eye on IP addresses that have interacted with honeypot links. If your IP falls into this category, it might raise a red flag, and you could find your scraping efforts blocked.

6. Always Cache and Log Like a Pro

We know when you collect valuable web data, and you want to do it efficiently. One of the best data scraping tips is caching.

Here’s how it works: whenever your scraper makes an HTTP request and receives a response, you store it in a database or log file. This raw data is too valuable for you. Why, you ask? Well, let’s break it down:

Offline Activities: By hoarding all the HTML pages your crawler visits, you’re essentially building an offline library of web data. This means you can go back and extract data you didn’t even know you needed during your first pass. It’s like having a second chance.

Selective Storage: Now, storing entire HTML documents can be a bit heavy on disk space. So, get clever – save only the crucial HTML elements in a string format in your database. It’s all about optimizing storage without sacrificing data.

Keep a Scraping Diary: To make the best of it, your scraper should keep a log. Record the pages it visits, the time it takes to scrape each page, the outcome of data extraction operations, and more.

7. Outsmart CAPTCHAs with a Solving Service

Let’s face it – CAPTCHAs are those guards designed to keep bots at bay. These little puzzles are too easy for humans but a nightmare for machines. If you fail to solve a CAPTCHA, you risk being labeled as a bot by anti-bot systems.

Many popular Content Delivery Network (CDN) services come equipped with CAPTCHAs as part of their anti-bot defenses. So, how do you navigate this obstacle course? CAPTCHA solving service can save you here.

CAPTCHA solving services utilize the power of human workers to tackle these riddles. These services automate the process of enlisting human help to crack CAPTCHAs. It’s like having a team of CAPTCHA-solving experts at your disposal.

For those seeking speed and efficiency, advanced web scraping APIs are available. These APIs are your shortcut to bypassing those CAPTCHA roadblocks.

8. Stay on the Right Side of the Law

We finish the list of data scraping tips without mentioning the legality of the web data retrieval process. It’s essential that you don’t step on any legal toes. In other words, you’re responsible for what you scrape, so always take a good look at the target website’s Terms of Service.

The Terms of Service show you the dos and don’ts of data scraping from that particular website. This information will tell you what’s fair and what’s off-limits. It tells you all about responsible web scraping.

Most of the time, you won’t have permission to republish scraped data elsewhere due to copyright restrictions. Ignoring these rules can land you in legal turmoil, and trust me, you want to steer clear of that.

Wrap Up!

Follow these advanced crawling tactics for web data retrieval to ensure a smooth web scraping process. Moreover, scrolling websites and infinite scrolls demand finesse, and a live crawler solves your problems.

Happy scraping!