Do you want to build your own web crawler with Scrapy and Python? This guide will guide you through each setup to set up and use Scrapy for SEO.
You might already have learned a little of Web Scraping with Python libraries like BeautifulSoup
, requests and Requests-HTML, however, building a simple scraper is a lot less complicated than building a web crawler.
What is Scrapy?
Scrapy is a free and open-source web crawling framework written in Python.
BeautifulSoup VS Scrapy?
BeautifulSoup
is incredible for simple Web Scraping when you know which pages you want to crawl. It is simple and easy to learn. However, when it comes to building more complex web crawlers, Scrapy
is much better.
Indeed, web crawlers are a lot more complex than they seem. They need to handle errors and redirects, evaluate links on a website and cover thousands and thousands of pages.
Building a web crawler with BeautifulSoup
will soon become very complex, and is bound for errors. The Scrapy
Python library handles that complexity for you.
Scrapy Now Works With Python 2 and Python 3
Scrapy has taken a while to be released with Python 3, but it is here now. This tutorial will show you how to work with Scrapy in Python 3. However, if you still want to use Python 2 with Scrapy, just go to the appendix at the end of this post: Use Scrapy with Python 2.
Basic Python Set-Up
Install Python
If you haven’t already, Install Python on your computer. For detailed steps read how to install Python using Anaconda.
Create a Project in Github and VSCode (Optional)
For this tutorial, I will use VSCode and Github to manage my project. I recommend that you do the same. If you don’t know how to use Github and VSCode, you can follow these simple tutorials.

One of the many reasons why you will want to use VSCode is that it is super simple to switch between Python versions.
Here are the simple steps (follow guides above for detailed steps).
First, go to Github and create a Scrapy repository. Copy the clone URL. Next, press Command + Shift + P
and type Git: Clone
. Paste the clone URL from the Github Repo. Once the repository is cloned, go to File
> Save Workspace as
and save your workspace.
Install Scrapy and Dependencies
You can download Scrapy and the documentation on Scrapy.org. You will also find information on how to download Scrapy with Pip and Scrapy with Anaconda. Let’s install it with pip
.
To install Scrapy in VScode, got to View
> Terminal
.
Add the following command in the terminal (without the $
sign).
$ pip install scrapy
Or with Anaconda.
$ conda install -c conda-forge scrapy
You will also need other packages to run pip properly.
$ pip install pyOpenSSL
$ pip install lxml
Test Scrapy
Scrapy lets you fetch a URL to test server response using scrapy shell
in the Terminal. I recommend you start testing the website you want to crawl first to see if there is some kind of problem.
$ scrapy shell
To get help using the shell, use shelp()
.
To exit the Scrapy shell use CTRL + Z
or type exit()
Then, you fetch the URL to process the server response.
$ fetch('https://ca.indeed.com')

As you can see, the server returns a 200
status code.
Analyze Response
To analyze the server response, here are a few useful functions
Read the response
view(response)
Scrapy will open the page for you in a new browser window.

Get Status Code
response.status

View the raw HTML of the page
print(response.text)
Get the Title of the page using Xpath
response.xpath('//title/text()').get()

Get H1s of the page using Xpath
response.xpath('//h1/text()').getall()

Indeed has no H1 on the Homepage. I guess that talks a lot about the importance of the H1 (or of their Homepage).
Extract data using CSS Selectors
Here, I’d like to know what are the most important keywords for Indeed.
response.css('.jobsearch-PopularSearchesPillBoxes-pillBoxText::text').extract()

Print Response Header
from pprint import pprint
pprint(response.headers)
Start a Project in Scrapy
We will now create a new project. First, exit the Shell by typing exit()
.
In the terminal, add the following command.
$ scrapy startproject indeed
We call Scrapy using the scrapy
command. Startproject
will initialize a new directory with the name of the project you give it, in our case indeed
. Files like __init.py__
will be added by default to the newly created crawler
directory.

Understand Default Files Created
There are 1 folder and 4 files created here. The Spider folder will contain your spiders as you create them. The files created are items.py
, pipelines.py
,settings.py
, middlewares.py
.
items.py
contains the elements that you want to scrape from a page: url, title, meta, etc.pipelines.py
contains the code that will tell what to do with the scraped data: cleaning HTML data, validating scraped data, dropping duplicates, storing the scraped item in a databasesettings.py
contains settings of the crawler such as the crawl-delay.middlewares.py
is where you can set-up proxies when crawling a website.
scrapy.cfg
is the configuration file and __init__.py
is the initialization file.
Create Your First Crawler
You can now create your first crawler by accessing the newly created directory and running genspider
.
Create the Template File
In the terminal, access the directory.
cd indeed
Create a template file
scrapy genspider indeedCA ca.indeed.com
This will create a spider called ‘indeedCA’ using a ‘basic’ template from Scrapy. A file named indeedCA.py
is now accessible in the Spider folder and contains this basic information:
# -*- coding: utf-8 -*-
import scrapy
class IndeedSpider(scrapy.Spider):
name = 'indeedCA'
allowed_domains = ['indeed.com']
start_urls = ['http://indeed.com/']
def parse(self, response):
pass
I will modify it a little to crawl for Python Jobs.

Extract Data From the Page
The parse()
function in the class is used to extract data from the HTML document. In this tutorial, I will extract the H1, the title and the meta description.
# -*- coding: utf-8 -*-
import scrapy
class IndeedcaSpider(scrapy.Spider):
name = 'indeedCA'
allowed_domains = ['ca.indeed.com']
start_urls = ['https://ca.indeed.com/Python-jobs']
def parse(self, response):
print("Fetching ... " + response.url)
#Extract data
h1 = response.xpath('//h1/text()').extract()
title = response.xpath('//title/text()').extract()
description = response.xpath('//meta[@name="description"]/@content').extract()
# Return in a combined list
data = zip(h1,title,description)
for value in data:
# create a dictionary to store the scraped data
scraped_data = {
#key:value
'url' : response.url,
'h1' : value[0],
'title' : value[1],
'description' : value[2]
}
yield scraped_data
Export Your Data
To export your data to a CSV file, go in the settings.py
file and add those lines.
FEED_FORMAT = 'csv'
FEED_URI = 'indeed.csv'
Or you can save it to JSON.
FEED_FORMAT = 'JSON'
FEED_URI = 'indeed.json'
Run the Spider
Running the spider is super simple. Only run this line in the terminal.
$ scrapy crawl indeedCA
This should return a indeed.csv
file with the extracted data from https://ca.indeed.com/Python-jobs
.

Make Your Web Crawler Respectful
Don’t Crawl Too Fast
Scrapy Web Crawlers are really fast. This means that they fetch a large number of pages in a low amount of time. This puts a strong load on servers.
If you have a web server that can handle 5 requests per second and a web crawler crawls 100 pages per second, or at 100 concurrent requests (threads) per second, that comes with a cost.
Crawl Delay
To avoid hitting the web servers too frequently, use the DOWNLOAD_DELAY setting in your settings.py
file.
DOWNLOAD_DELAY = 3
This will add a 3 seconds delay between requests.
Concurrent Requests (threads)
Most servers can handle 4-5 concurrent requests. If crawling speed is not an issue for you, consider reducing to only a single thread.
CONCURRENT_REQUESTS_PER_DOMAIN = 4
AutoThrottle
Each website varies in the number of requests it can handle. AutoThrottle is the perfect setting. It automatically adjusts the delays between requests depending on the current web server load.
AUTOTHROTTLE_ENABLED = True
Identify Yourself
Whenever you hit a server, you leave a trace. Website owners can identify you using your IP address, but also with your user agent.
By identifying yourself as a bot, you help website owners understand your intentions.
USER_AGENT = 'webcrawlerbot (+http://www.yourdomain.com)'
Robots.txt
Websites create Robots.txt to make sure that some pages are not crawled by bot. You could decide that you don’t want to follow robots.txt directions, however you should, both ethically and legally.
ROBOTSTXT_OBEY = True
Use Cache
By building your own web crawler, you will try many things. Sometimes, you will request the same page over and over just to fine-tune your scraper.
There is a functionality that lets you cache the page. This way, when you fetch the page the 2nd, 3rd or 99th time in a minute, it will use the cached version of the page instead of sending a new request.
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 500
Conclusion
Scrapy is the best and most-powerful open-source library in Python to build a web crawler. We have just scratched the surface, and it may still seem confusing to use.
If this is the case, I suggest that you start with something simpler and learn how to use requests and BeautifulSoup together for webscraping purposes.

SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. In a quest to programmatic SEO for large organizations through the use of Python, R and machine learning.