Scrapy for SEO (Python)

Share this post

Do you want to build your own web crawler with Scrapy and Python? This guide will guide you through each setup to set up and use Scrapy for SEO.

You might already have learned a little of Web Scraping with Python libraries like BeautifulSoup and Requests-HTML, however, building a simple scraper is a lot less complicated than building a web crawler.

What is Scrapy?

Scrapy is a free and open-source web crawling framework written in Python.


Subscribe to my Newsletter


BeautifulSoup or Scrapy?

BeautifulSoup is incredible for simple Web Scraping when you know which pages you want to crawl. It is simple and easy to learn. However, when it comes to building more complex web crawlers, Scrapy is much better.

Indeed, web crawlers are a lot more complex than they seem. They need to handle errors and redirects, evaluate links on a website and cover thousands and thousands of pages.

Building a web crawler with BeautifulSoup will soon become very complex, and is bound for errors. The Scrapy Python library handles that complexity for you.

Scrapy Now Works With Python 2 and Python 3

Scrapy has taken a while to be released with Python 3, but it is here now. This tutorial will show you how to work with Scrapy in Python 3. However, if you still want to use Python 2 with Scrapy, just go to the appendix at the end of this post: Use Scrapy with Python 2.

Basic Python Set-Up

Install Python

If you haven’t already, Install Python on your computer. For detailed steps read how to install Python using Anaconda.

Create a Project in Github and VSCode (Optional)

For this tutorial, I will use VSCode and Github to manage my project. I recommend that you do the same. If you don’t know how to use Github and VSCode, you can follow these simple tutorials.

Create a project in VSCode
Create a project in VSCode

One of the many reasons why you will want to use VSCode is that it is super simple to switch between Python versions.

Here are the simple steps (follow guides above for detailed steps).

First, go to Github and create a Scrapy repository. Copy the clone URL. Next, press Command + Shift + P and type Git: Clone. Paste the clone URL from the Github Repo. Once the repository is cloned, go to File > Save Workspace as and save your workspace.

You Might Also Like  Tutorial: Run Python With Spyder IDE

Install Scrapy and Dependencies

You can download Scrapy and the documentation on Scrapy.org. You will also find information on how to download Scrapy with Pip and Scrapy with Anaconda. Let’s install it with pip.

To install Scrapy in VScode, got to View > Terminal.

Add the following command in the terminal (without the $ sign).

$ pip install scrapy

Or with Anaconda.

$ conda install -c conda-forge scrapy

You will also need other packages to run pip properly.

$ pip install pyOpenSSL
$ pip install lxml

Test Scrapy

Scrapy lets you fetch a URL to test server response using scrapy shell in the Terminal. I recommend you start testing the website you want to crawl first to see if there is some kind of problem.

$ scrapy shell

To get help using the shell, use shelp().

To exit the Scrapy shell use CTRL + Z or type exit()

Then, you fetch the URL to process the server response.

$ fetch('https://ca.indeed.com')

As you can see, the server returns a 200 status code.

Analyze Response

To analyze the server response, here are a few useful functions

Read the response

view(response)

Scrapy will open the page for you in a new browser window.

Get Status Code

response.status

View the raw HTML of the page

print(response.text)

Get the Title of the page using Xpath

response.xpath('//title/text()').get()

Get H1s of the page using Xpath

response.xpath('//h1/text()').getall()

Indeed has no H1 on the Homepage. I guess that talks a lot about the importance of the H1 (or of their Homepage).

Extract data using CSS Selectors

Here, I’d like to know what are the most important keywords for Indeed.

response.css('.jobsearch-PopularSearchesPillBoxes-pillBoxText::text').extract()   

Print Response Header

from pprint import pprint
pprint(response.headers)

Start a Project in Scrapy

We will now create a new project. First, exit the Shell by typing exit().

In the terminal, add the following command.

$ scrapy startproject indeed

We call Scrapy using the scrapy command. Startproject will initialize a new directory with the name of the project you give it, in our case indeed. Files like __init.py__ will be added by default to the newly created crawler directory.

Start project with Scrapy
Start a project with Scrapy

Understand Default Files Created

There are 1 folder and 4 files created here. The Spider folder will contain your spiders as you create them. The files created are items.py, pipelines.py,settings.py, middlewares.py.

  • items.py contains the elements that you want to scrape from a page: url, title, meta, etc.
  • pipelines.py contains the code that will tell what to do with the scraped data: cleaning HTML data, validating scraped data, dropping duplicates, storing the scraped item in a database
  • settings.py contains settings of the crawler such as the crawl-delay.
  • middlewares.py is where you can set-up proxies when crawling a website.
You Might Also Like  Google Search Console Data From a List of URLs

scrapy.cfg is the configuration file and __init__.py is the initialization file.

Create Your First Crawler

You can now create your first crawler by accessing the newly created directory and running genspider.

Create the Template File

In the terminal, access the directory.

cd indeed

Create a template file

scrapy genspider indeedCA ca.indeed.com

This will create a spider called ‘indeedCA’ using a ‘basic’ template from Scrapy. A file named indeedCA.py is now accessible in the Spider folder and contains this basic information:

# -*- coding: utf-8 -*-
import scrapy


class IndeedSpider(scrapy.Spider):
    name = 'indeedCA'
    allowed_domains = ['indeed.com']
    start_urls = ['http://indeed.com/']

    def parse(self, response):
        pass

I will modify it a little to crawl for Python Jobs.

Extract Data From the Page

The parse() function in the class is used to extract data from the HTML document. In this tutorial, I will extract the H1, the title and the meta description.

# -*- coding: utf-8 -*-
import scrapy


class IndeedcaSpider(scrapy.Spider):
    name = 'indeedCA'
    allowed_domains = ['ca.indeed.com']
    start_urls = ['https://ca.indeed.com/Python-jobs']

    def parse(self, response):

        print("Fetching ... " + response.url)

        #Extract data
        h1 = response.xpath('//h1/text()').extract()
        title = response.xpath('//title/text()').extract()
        description = response.xpath('//meta[@name="description"]/@content').extract()
        
        # Return in a combined list
        data = zip(h1,title,description)

        for value in data:
            # create a dictionary to store the scraped data
            scraped_data = {
                #key:value
                'url' : response.url,
                'h1' : value[0], 
                'title' : value[1],
                'description' : value[2]
            }

            yield scraped_data

Export Your Data

To export your data to a CSV file, go in the settings.py file and add those lines.

FEED_FORMAT = 'csv'
FEED_URI = 'indeed.csv'

Or you can save it to JSON.

FEED_FORMAT = 'JSON'
FEED_URI = 'indeed.json'

Run the Spider

Running the spider is super simple. Only run this line in the terminal.

$ scrapy crawl indeedCA

This should return a indeed.csv file with the extracted data from https://ca.indeed.com/Python-jobs.

Make Your Web Crawler Respectful

Don’t Crawl Too Fast

Scrapy Web Crawlers are really fast. This means that they fetch a large number of pages in a low amount of time. This puts a strong load on servers.

You Might Also Like  Reddit API with Python (Complete Guide)

If you have a web server that can handle 5 requests per second and a web crawler crawls 100 pages per second, or at 100 concurrent requests (threads) per second, that comes with a cost.

Crawl Delay

To avoid hitting the web servers too frequently, use the DOWNLOAD_DELAY setting in your settings.py file.

DOWNLOAD_DELAY = 3

This will add a 3 seconds delay between requests.

Concurrent Requests (threads)

Most servers can handle 4-5 concurrent requests. If crawling speed is not an issue for you, consider reducing to only a single thread.

CONCURRENT_REQUESTS_PER_DOMAIN = 4

AutoThrottle

Each website varies in the number of requests it can handle. AutoThrottle is the perfect setting. It automatically adjusts the delays between requests depending on the current web server load.

AUTOTHROTTLE_ENABLED = True

Identify Yourself

Whenever you hit a server, you leave a trace. Website owners can identify you using your IP address, but also with your user agent.

By identifying yourself as a bot, you help website owners understand your intentions.

USER_AGENT = 'webcrawlerbot (+http://www.yourdomain.com)'

Robots.txt

Websites create Robots.txt to make sure that some pages are not crawled by bot. You could decide that you don’t want to follow robots.txt directions, however you should, both ethically and legally.

ROBOTSTXT_OBEY = True

Use Cache

By building your own web crawler, you will try many things. Sometimes, you will request the same page over and over just to fine-tune your scraper.

There is a functionality that lets you cache the page. This way, when you fetch the page the 2nd, 3rd or 99th time in a minute, it will use the cached version of the page instead of sending a new request.

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 500

Conclusion

Scrapy is the best and most-powerful open-source library in Python to build a web crawler. We have just scratched the surface, and it may still seem confusing to use.

If this is the case, I suggest that you start with something simpler and learn how to use requests and BeautifulSoup together for webscraping purposes.