Do you want to build your own web crawler with Scrapy and Python? This guide will guide you through each setup to set up and use Scrapy for SEO.
You might already have learned a little of Web Scraping with Python libraries like BeautifulSoup, requests and Requests-HTML, however, building a simple scraper is a lot less complicated than building a web crawler.
What is Scrapy?
Scrapy is a free and open-source web crawling framework written in Python.
What is Web Scraping
Web scraping is the process of using a bot to extract data from a website and export it into a digestible format. Web scrapers extract HTML from a web page, which is then parsed to extract information.
How is Scrapy useful in Web Scraping and Web Crawling
The Scrapy Python framework takes care of the complexity of web crawling and web scraping by providing functions to take care of things such as recursive download, timeouts, respecting robots.txt, crawl speed, etc.
BeautifulSoup VS Scrapy?
BeautifulSoup
is incredible for simple Web Scraping when you know which pages you want to crawl. It is simple and easy to learn. However, when it comes to building more complex web crawlers, Scrapy
is much better.
Indeed, web crawlers are a lot more complex than they seem. They need to handle errors and redirects, evaluate links on a website and cover thousands and thousands of pages.
Building a web crawler with BeautifulSoup
will soon become very complex, and is bound for errors. The Scrapy
Python library handles that complexity for you.
Scrapy Now Works With Python 2 and Python 3
Scrapy has taken a while to be released with Python 3, but it is here now. This tutorial will show you how to work with Scrapy in Python 3. However, if you still want to use Python 2 with Scrapy, just go to the appendix at the end of this post: Use Scrapy with Python 2.
Basic Python Set-Up
Install Python
If you haven’t already, Install Python on your computer. For detailed steps read how to install Python using Anaconda.
Create a Project in Github and VSCode (Optional)
For this tutorial, I will use VSCode and Github to manage my project. I recommend that you do the same. If you don’t know how to use Github and VSCode, you can follow these simple tutorials.

One of the many reasons why you will want to use VSCode is that it is super simple to switch between Python versions.
Here are the simple steps (follow guides above for detailed steps).
First, go to Github and create a Scrapy repository. Copy the clone URL. Next, press Command + Shift + P
and type Git: Clone
. Paste the clone URL from the Github Repo. Once the repository is cloned, go to File
> Save Workspace as
and save your workspace.
Install Scrapy and Dependencies
You can download Scrapy and the documentation on Scrapy.org. You will also find information on how to download Scrapy with Pip and Scrapy with Anaconda. Let’s install it with pip
.
To install Scrapy in VScode, got to View
> Terminal
.
Add the following command in the terminal (without the $
sign).
$ pip install scrapy
Or with Anaconda.
$ conda install -c conda-forge scrapy
You will also need other packages to run pip properly.
$ pip install pyOpenSSL
$ pip install lxml
How to use the Scrapy Selector in Python
The Scrapy Selector is a wrapper of the parsel Python library that simplifies the integration of Scrapy Response objects. To use the Selector object in Scrapy, import the class from the scrapy library and call the Selector() object with your HTML as the value of the text
parameter.
You can use the Scrapy Selector to scrape any HTML.
from scrapy import Selector
# Assign custom HTML to variable
html = '''<html>
<head>
<title>Title of your web page</title>
</head>
<body>
<h1>Heading of the page</h1>
<p id="first-paragraph" class="paragraph">Paragraph of text</p>
<p class="paragraph">Paragraph of <strong>text 2</strong></p>
<div><p class="paragraph">Nested paragraph</p></div>
<a href="/a-link">hyperlink</a>
</body>
</html>'''
# Instantiate Selector
sel = Selector(text=html)
The Selector
returns a SelectorList
of Selector
objects.
Scrapy Selector Methods
The most popular methods to use with a Scrapy Selector are:
- xpath()
- css()
- extract()
- extract_first()
- get()
How to Use XPath on Scrapy Selector
To use XPath to extract elements from an HTML document with Scrapy, use the xpath()
method on the Selector
object with your XPath expression as an argument.
# Use Xpath on Selector Object
sel.xpath('//p')
[<Selector xpath='//p' data='<p id="first-paragraph" class="paragr...'>,
<Selector xpath='//p' data='<p class="paragraph">Paragraph of <st...'>,
<Selector xpath='//p' data='<p class="paragraph">Nested paragraph...'>]
How to Use CSS Selector with Scrapy
To use a CSS Selector to extract elements from an HTML document with Scrapy, use the css()
method on the Selector
object with your CSS selector expression as an argument.
# Use CSS on Selector Object
sel.css('p')
[<Selector xpath='//p' data='<p id="first-paragraph" class="paragr...'>,
<Selector xpath='//p' data='<p class="paragraph">Paragraph of <st...'>,
<Selector xpath='//p' data='<p class="paragraph">Nested paragraph...'>]
How to Apply Multiple Methods with Scrapy Selector
To apply multiple methods to a Scrapy Selector, chain methods on the selector object.
# Chaining on Selector Object
sel.xpath('//div').css('p')
[<Selector xpath='descendant-or-self::p' data='<p class="paragraph">Nested paragraph...'>]
Extract Data From Selector with Extract()
Use the extract()
method to access data within the Selectors
in the SelectorList
.
sel.xpath('//p').extract()
Output:
['<p>Paragraph of text</p>', '<p>Paragraph of text 2</p>']
Get First Element from Selector with Extract_first()
To get the data of the first item in the SelectorList, use the extract_first() method.
sel.xpath('//p').extract_first()
'<p>Paragraph of text</p>'
Web Scraping with Requests and Scrapy
You can use the python requests library to fetch the HTML of a webpage and then use the scrapy Selector to parse the HTML with XPath.
Below we will extract all the links on a page with Scrapy and Requests.
from scrapy import Selector
import requests
url = 'https://crawler-test.com/'
response = requests.get(url)
html = response.content
sel = Selector(text = html)
sel.xpath('//a/@href').extract()
How to Make a Scrapy Web Crawler
To make a Scrapy web crawler, create a class that inherits from scrapy.Spider, add the start_requests() method to define URLs to crawl and use a parsing method as a callback to process each page.
Create the Scrapy Spider Class
To create the scrapy spider, create a class that inherit from the scrapy.Spider object, and give it a name.
# Import scrapy library
import scrapy
# Create the spider class
class SimpleSpider(scrapy.Spider):
name = "SimpleSpider"
# Do something
Add the Start_Requests Method
The start_requests()
method is required and takes self as its input. It defines which pages to crawl and what to do with the crawled pages. It must be named start_requests().
Then, it loops through each url and yields a scrapy.Request() object.
The Request object returns a Response
object in the form of a SelectorList
of Response
objects.
Simply put, it take the url we give it and sends its response to the parse() method defined in the callback.
# Import scrapy library
import scrapy
from scrapy.crawler import CrawlerProcess
from urllib.parse import urljoin
# Create the spider class
class SimpleSpider(scrapy.Spider):
name = "SimpleSpider"
# start_requests method
def start_requests(self):
# Urls to crawl
urls = ["https://www.crawler-test.com"]
for url in urls:
# Make the request and then execute the parse method
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# Do something
Add a Parser to the Scrapy Spider
Add a parser to the scrapy spider so that you actually do something with the URL that you crawl.
Below, the parse()
method receives the response object from the scrapy.Request call. The name of this method has to be the same as the one given to the callback parameter of the start_requests() method.
In the parse() method below, we are extracting all links using a CSS selector specific to scrapy: a::attr(href)
.
Then, we use urljoin() to process relative and absolute URLs.
Finally, we write each url to a file.
# Import scrapy library
import scrapy
from scrapy.crawler import CrawlerProcess
from urllib.parse import urljoin
# Import scrapy library
import scrapy
# Create the spider class
class SimpleSpider(scrapy.Spider):
name = "SimpleSpider"
# Do something
# start_requests method
def start_requests(self):
# Urls to crawl
urls = ["https://www.crawler-test.com"]
for url in urls:
# Make the request and then execute the parse method
yield scrapy.Request(url=url, callback=self.parse)
# parse method
def parse(self, response):
# Extract Href from all links
links = response.css('a::attr(href)').extract()
# Create text file to add links to
with open('links.txt', 'w') as f:
# loop each link
for link in links:
# Join absolute and Relative URLs
link = urljoin("https://www.crawler-test.com",link)
# Write link to file
f.write(link+'\n')
Run the Spider
Run the spider by calling the CrawlerProcess class of the scrapy.crawler library.
# Run the Spider
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(SimpleSpider)
process.start()
Full Code for the Simple Spider
Here is the code for the above discussed steps.
# Import scrapy library
import scrapy
from scrapy.crawler import CrawlerProcess
from urllib.parse import urljoin
# Create the spider class
class SimpleSpider(scrapy.Spider):
name = "SimpleSpider"
# start_requests method
def start_requests(self):
# Urls to crawl
urls = ["https://www.crawler-test.com"]
for url in urls:
# Make the request and then execute the parse method
yield scrapy.Request(url=url, callback=self.parse)
# parse method
def parse(self, response):
# Extract Href from all links
links = response.css('a::attr(href)').extract()
# Create text file to add links to
with open('links.txt', 'w') as f:
# loop each link
for link in links:
# Join absolute and Relative URLs
link = urljoin("https://www.crawler-test.com",link)
# Write link to file
f.write(link+'\n')
# Run the Spider
process = CrawlerProcess()
process.crawl(SimpleSpider)
process.start()
We will now go the extra step by building a web crawler using the Scrapy Shell.
How to Build a Web Crawler with the Scrapy Shell
This next section will show you how to use the Scrapy shell to crawl a website.
Scrapy lets you fetch a URL to test server response using scrapy shell
in the Terminal. I recommend you start testing the website you want to crawl first to see if there is some kind of problem.
$ scrapy shell
To get help using the shell, use shelp()
.
To exit the Scrapy shell use CTRL + Z
or type exit()
Then, you fetch the URL to process the server response.
$ fetch('https://ca.indeed.com')

As you can see, the server returns a 200
status code.
Analyze Response
To analyze the server response, here are a few useful functions
Read the response
view(response)
Scrapy will open the page for you in a new browser window.

Get Status Code
response.status

View the raw HTML of the page
print(response.text)
Get the Title of the page using Xpath
response.xpath('//title/text()').get()

Get H1s of the page using Xpath
response.xpath('//h1/text()').getall()

Indeed has no H1 on the Homepage. I guess that talks a lot about the importance of the H1 (or of their Homepage).
Extract data using CSS Selectors
Here, I’d like to know what are the most important keywords for Indeed.
response.css('.jobsearch-PopularSearchesPillBoxes-pillBoxText::text').extract()

Print Response Header
from pprint import pprint
pprint(response.headers)
Start a Project in Scrapy
We will now create a new project. First, exit the Shell by typing exit()
.
In the terminal, add the following command.
$ scrapy startproject indeed
We call Scrapy using the scrapy
command. Startproject
will initialize a new directory with the name of the project you give it, in our case indeed
. Files like __init.py__
will be added by default to the newly created crawler
directory.

Understand Default Files Created
There are 1 folder and 4 files created here. The Spider folder will contain your spiders as you create them. The files created are items.py
, pipelines.py
,settings.py
, middlewares.py
.
items.py
contains the elements that you want to scrape from a page: url, title, meta, etc.pipelines.py
contains the code that will tell what to do with the scraped data: cleaning HTML data, validating scraped data, dropping duplicates, storing the scraped item in a databasesettings.py
contains settings of the crawler such as the crawl-delay.middlewares.py
is where you can set-up proxies when crawling a website.
scrapy.cfg
is the configuration file and __init__.py
is the initialization file.
Create Your First Crawler
You can now create your first crawler by accessing the newly created directory and running genspider
.
Create the Template File
In the terminal, access the directory.
cd indeed
Create a template file
scrapy genspider indeedCA ca.indeed.com
This will create a spider called ‘indeedCA’ using a ‘basic’ template from Scrapy. A file named indeedCA.py
is now accessible in the Spider folder and contains this basic information:
# -*- coding: utf-8 -*-
import scrapy # import
class IndeedSpider(scrapy.Spider):
name = 'indeedCA'
allowed_domains = ['indeed.com']
start_urls = ['http://indeed.com/']
def parse(self, response):
pass
What this code does is that it imports scrapy and then creates a Python object, or a class, that contains the code for the spider.
The Scrapy spider class is in the following format:
class SpiderName(scrapy.Spider):
name = 'SpiderName'
# Do something
This class inherits the methods from scrapy.Spider
and tells what websites to scrape and how to scrape them.
I will modify it a little to crawl for Python Jobs.

Extract Data From the Page
The parse()
function in the class is used to extract data from the HTML document. In this tutorial, I will extract the H1, the title and the meta description.
# -*- coding: utf-8 -*-
import scrapy
class IndeedcaSpider(scrapy.Spider):
name = 'indeedCA'
allowed_domains = ['ca.indeed.com']
start_urls = ['https://ca.indeed.com/Python-jobs']
def parse(self, response):
print("Fetching ... " + response.url)
#Extract data
h1 = response.xpath('//h1/text()').extract()
title = response.xpath('//title/text()').extract()
description = response.xpath('//meta[@name="description"]/@content').extract()
# Return in a combined list
data = zip(h1,title,description)
for value in data:
# create a dictionary to store the scraped data
scraped_data = {
#key:value
'url' : response.url,
'h1' : value[0],
'title' : value[1],
'description' : value[2]
}
yield scraped_data
Export Your Data
To export your data to a CSV file, go in the settings.py
file and add those lines.
FEED_FORMAT = 'csv'
FEED_URI = 'indeed.csv'
Or you can save it to JSON.
FEED_FORMAT = 'JSON'
FEED_URI = 'indeed.json'
Run the Spider
Running the spider is super simple. Only run this line in the terminal.
$ scrapy crawl indeedCA
This should return a indeed.csv
file with the extracted data from https://ca.indeed.com/Python-jobs
.

Make Your Web Crawler Respectful
Don’t Crawl Too Fast
Scrapy Web Crawlers are really fast. This means that they fetch a large number of pages in a low amount of time. This puts a strong load on servers.
If you have a web server that can handle 5 requests per second and a web crawler crawls 100 pages per second, or at 100 concurrent requests (threads) per second, that comes with a cost.
Crawl Delay
To avoid hitting the web servers too frequently, use the DOWNLOAD_DELAY setting in your settings.py
file.
DOWNLOAD_DELAY = 3
This will add a 3 seconds delay between requests.
Concurrent Requests (threads)
Most servers can handle 4-5 concurrent requests. If crawling speed is not an issue for you, consider reducing to only a single thread.
CONCURRENT_REQUESTS_PER_DOMAIN = 4
AutoThrottle
Each website varies in the number of requests it can handle. AutoThrottle is the perfect setting. It automatically adjusts the delays between requests depending on the current web server load.
AUTOTHROTTLE_ENABLED = True
Identify Yourself
Whenever you hit a server, you leave a trace. Website owners can identify you using your IP address, but also with your user agent.
By identifying yourself as a bot, you help website owners understand your intentions.
USER_AGENT = 'webcrawlerbot (+http://www.yourdomain.com)'
Robots.txt
Websites create Robots.txt to make sure that some pages are not crawled by bot. You could decide that you don’t want to follow robots.txt directions, however you should, both ethically and legally.
ROBOTSTXT_OBEY = True
Use Cache
By building your own web crawler, you will try many things. Sometimes, you will request the same page over and over just to fine-tune your scraper.
There is a functionality that lets you cache the page. This way, when you fetch the page the 2nd, 3rd or 99th time in a minute, it will use the cached version of the page instead of sending a new request.
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 500
Articles Related to Web Scraping
Conclusion
Scrapy is the best and most-powerful open-source library in Python to build a web crawler. We have just scratched the surface, and it may still seem confusing to use.
If this is the case, I suggest that you start with something simpler and learn how to use requests and BeautifulSoup together for webscraping purposes.

SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.