With over 2.7 billion monthly active users as of the second quarter of 2020, Facebook is the biggest social network worldwide. There are also a whopping 620 million groups according to Google at the time of this writing. Just imagine the data you can gather on all of these groups which you can then use for your project or business.

Here in ProxyCrawl, we care about data, all our team loves the freedom that the internet provides and we believe that if it is available for the public then everyone has the right to see it. However, we also respect privacy, so in this article, we will be focusing on creating a simple scraper using PyCharm with Python2 interpreter which you can then use to crawl your targeted public groups.

Preparation

Before we jump into actual coding, there are few things that we need to set up.

First, create a new project in PyCharm and name it proxycrawl.py and once done, right-click the project and create a new Python file named facebookscraper as shown on the image below:

Second, let’s make sure that we are using the Python 2 interpreter. Press ctrl + alt + S (on Windows) to select the interpreter:

Scraping Facebook Groups

Now that we have successfully set up our file, it’s time to write the code. We’ll just do the very basic for now, so this will be short.

To start, we will need to import our modules:

1
2
3
from urllib2 import urlopen
from urllib import quote_plus
import json

Next, we will pass the URL for scraping. It is important to know that when scraping Facebook, we will need to use our Javascript token, as well as the following parameters:

&autoparse=true this allows us to get the scraped data of the page requested.

&scroll=true when using the Javascript token, this parameter will allow the API to scroll the page with a scroll interval of 10 seconds.

1
2
3
url = quote_plus('https://www.facebook.com/PUBLIC_FACEBOOK_GROUP')

handler = urlopen('https://api.proxycrawl.com/?token=YOUR_JS_TOKEN&format=json&autoparse=true&scroll=true&url=' + url)

For the last part of our code, we just need to print the response in a readable format. The full code should now look like this:

1
2
3
4
5
6
7
8
9
10
from urllib2 import urlopen
from urllib import quote_plus
import json

url = quote_plus('https://www.facebook.com/groups/198722650913932')

handler = urlopen('https://api.proxycrawl.com/?token=YOUR_JS_TOKEN&format=json&autoparse=true&scroll=true&url=' + url)

pretty_json = json.loads(handler.read())
print json.dumps(pretty_json['body'], indent=4)

To run the code, simply press shift + f10 (on Windows) and you should get something similar below:

There you have it; the code is ready and you can apply this to any of your projects. Remember that you are free to use our Python Library as well.

Facebook is known to be one of the hardest to crawl, so if you encounter any issues, just send us a message and our ProxyCrawl support team will be happy to assist.

Enjoy crawling!