Web Scraping with Scrapy in Python (Crawler Example)

Do you want to build your own web crawler with Scrapy and Python? This guide will guide you through each setup to set up and use Scrapy for SEO.

You might already have learned a little of Web Scraping with Python libraries like BeautifulSoup, requests and Requests-HTML, however, building a simple scraper is a lot less complicated than building a web crawler.

What is Scrapy?

Scrapy is a free and open-source web crawling framework written in Python.


Subscribe to my Newsletter


What is Web Scraping

Web scraping is the process of using a bot to extract data from a website and export it into a digestible format. Web scrapers extract HTML from a web page, which is then parsed to extract information.

How is Scrapy useful in Web Scraping and Web Crawling

The Scrapy Python framework takes care of the complexity of web crawling and web scraping by providing functions to take care of things such as recursive download, timeouts, respecting robots.txt, crawl speed, etc.

BeautifulSoup VS Scrapy?

BeautifulSoup is incredible for simple Web Scraping when you know which pages you want to crawl. It is simple and easy to learn. However, when it comes to building more complex web crawlers, Scrapy is much better.

Indeed, web crawlers are a lot more complex than they seem. They need to handle errors and redirects, evaluate links on a website and cover thousands and thousands of pages.

Building a web crawler with BeautifulSoup will soon become very complex, and is bound for errors. The Scrapy Python library handles that complexity for you.

Scrapy Now Works With Python 2 and Python 3

Scrapy has taken a while to be released with Python 3, but it is here now. This tutorial will show you how to work with Scrapy in Python 3. However, if you still want to use Python 2 with Scrapy, just go to the appendix at the end of this post: Use Scrapy with Python 2.

Basic Python Set-Up

Install Python

If you haven’t already, Install Python on your computer. For detailed steps read how to install Python using Anaconda.

Create a Project in Github and VSCode (Optional)

For this tutorial, I will use VSCode and Github to manage my project. I recommend that you do the same. If you don’t know how to use Github and VSCode, you can follow these simple tutorials.

Create a project in VSCode
Create a project in VSCode

One of the many reasons why you will want to use VSCode is that it is super simple to switch between Python versions.

Here are the simple steps (follow guides above for detailed steps).

First, go to Github and create a Scrapy repository. Copy the clone URL. Next, press Command + Shift + P and type Git: Clone. Paste the clone URL from the Github Repo. Once the repository is cloned, go to File > Save Workspace as and save your workspace.

Install Scrapy and Dependencies

You can download Scrapy and the documentation on Scrapy.org. You will also find information on how to download Scrapy with Pip and Scrapy with Anaconda. Let’s install it with pip.

To install Scrapy in VScode, got to View > Terminal.

Add the following command in the terminal (without the $ sign).

$ pip install scrapy

Or with Anaconda.

$ conda install -c conda-forge scrapy

You will also need other packages to run pip properly.

$ pip install pyOpenSSL
$ pip install lxml

How to use the Scrapy Selector in Python

The Scrapy Selector is a wrapper of the parsel Python library that simplifies the integration of Scrapy Response objects. To use the Selector object in Scrapy, import the class from the scrapy library and call the Selector() object with your HTML as the value of the text parameter.

You can use the Scrapy Selector to scrape any HTML.

from scrapy import Selector

# Assign custom HTML to variable
html = '''<html>
    <head>
        <title>Title of your web page</title>
    </head>
    <body>
        <h1>Heading of the page</h1>
        <p id="first-paragraph" class="paragraph">Paragraph of text</p>
        <p class="paragraph">Paragraph of <strong>text 2</strong></p>
        <div><p class="paragraph">Nested paragraph</p></div>
        <a href="/a-link">hyperlink</a>
    </body>
</html>'''

# Instantiate Selector
sel = Selector(text=html)

The Selector returns a SelectorList of Selector objects.

Scrapy Selector Methods

The most popular methods to use with a Scrapy Selector are:

  • xpath()
  • css()
  • extract()
  • extract_first()
  • get()

How to Use XPath on Scrapy Selector

To use XPath to extract elements from an HTML document with Scrapy, use the xpath() method on the Selector object with your XPath expression as an argument.

# Use Xpath on Selector Object
sel.xpath('//p')
[<Selector xpath='//p' data='<p id="first-paragraph" class="paragr...'>,
 <Selector xpath='//p' data='<p class="paragraph">Paragraph of <st...'>,
 <Selector xpath='//p' data='<p class="paragraph">Nested paragraph...'>]

How to Use CSS Selector with Scrapy

To use a CSS Selector to extract elements from an HTML document with Scrapy, use the css() method on the Selector object with your CSS selector expression as an argument.

# Use CSS on Selector Object
sel.css('p')
[<Selector xpath='//p' data='<p id="first-paragraph" class="paragr...'>,
 <Selector xpath='//p' data='<p class="paragraph">Paragraph of <st...'>,
 <Selector xpath='//p' data='<p class="paragraph">Nested paragraph...'>]

How to Apply Multiple Methods with Scrapy Selector

To apply multiple methods to a Scrapy Selector, chain methods on the selector object.

# Chaining on Selector Object
sel.xpath('//div').css('p')
[<Selector xpath='descendant-or-self::p' data='<p class="paragraph">Nested paragraph...'>]

Extract Data From Selector with Extract()

Use the extract() method to access data within the Selectors in the SelectorList.

sel.xpath('//p').extract()

Output:

['<p>Paragraph of text</p>', '<p>Paragraph of text 2</p>']

Get First Element from Selector with Extract_first()

To get the data of the first item in the SelectorList, use the extract_first() method.

sel.xpath('//p').extract_first()
'<p>Paragraph of text</p>'

Web Scraping with Requests and Scrapy

You can use the python requests library to fetch the HTML of a webpage and then use the scrapy Selector to parse the HTML with XPath.

Below we will extract all the links on a page with Scrapy and Requests.

from scrapy import Selector
import requests

url = 'https://crawler-test.com/'
response = requests.get(url)
html = response.content

sel = Selector(text = html)

sel.xpath('//a/@href').extract() 

How to Make a Scrapy Web Crawler

To make a Scrapy web crawler, create a class that inherits from scrapy.Spider, add the start_requests() method to define URLs to crawl and use a parsing method as a callback to process each page.

Create the Scrapy Spider Class

To create the scrapy spider, create a class that inherit from the scrapy.Spider object, and give it a name.

# Import scrapy library
import scrapy

# Create the spider class
class SimpleSpider(scrapy.Spider):
  name = "SimpleSpider"
  # Do something

Add the Start_Requests Method

The start_requests() method is required and takes self as its input. It defines which pages to crawl and what to do with the crawled pages. It must be named start_requests().

Then, it loops through each url and yields a scrapy.Request() object.

The Request object returns a Response object in the form of a SelectorList of Response objects.

Simply put, it take the url we give it and sends its response to the parse() method defined in the callback.

# Import scrapy library
import scrapy
from scrapy.crawler import CrawlerProcess
from urllib.parse import urljoin

# Create the spider class
class SimpleSpider(scrapy.Spider):
    name = "SimpleSpider"

    # start_requests method
    def start_requests(self):
        # Urls to crawl
        urls = ["https://www.crawler-test.com"]
        for url in urls:
        # Make the request and then execute the parse method
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # Do something

Add a Parser to the Scrapy Spider

Add a parser to the scrapy spider so that you actually do something with the URL that you crawl.

Below, the parse() method receives the response object from the scrapy.Request call. The name of this method has to be the same as the one given to the callback parameter of the start_requests() method.

In the parse() method below, we are extracting all links using a CSS selector specific to scrapy: a::attr(href).

Then, we use urljoin() to process relative and absolute URLs.

Finally, we write each url to a file.

# Import scrapy library
import scrapy
from scrapy.crawler import CrawlerProcess
from urllib.parse import urljoin

# Import scrapy library
import scrapy

# Create the spider class
class SimpleSpider(scrapy.Spider):
  name = "SimpleSpider"
  # Do something
  # start_requests method
  def start_requests(self):
    # Urls to crawl
    urls = ["https://www.crawler-test.com"]
    for url in urls:
      # Make the request and then execute the parse method
      yield scrapy.Request(url=url, callback=self.parse)

  # parse method
  def parse(self, response):
    # Extract Href from all links
    links = response.css('a::attr(href)').extract()
    # Create text file to add links to
    with open('links.txt', 'w') as f:
      # loop each link
      for link in links:
        # Join absolute and Relative URLs
        link = urljoin("https://www.crawler-test.com",link)
        # Write link to file
        f.write(link+'\n')

Run the Spider

Run the spider by calling the CrawlerProcess class of the scrapy.crawler library.

# Run the Spider
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess()
process.crawl(SimpleSpider)
process.start()

Full Code for the Simple Spider

Here is the code for the above discussed steps.

# Import scrapy library
import scrapy
from scrapy.crawler import CrawlerProcess
from urllib.parse import urljoin

# Create the spider class
class SimpleSpider(scrapy.Spider):
  name = "SimpleSpider"

  # start_requests method
  def start_requests(self):
    # Urls to crawl
    urls = ["https://www.crawler-test.com"]
    for url in urls:
      # Make the request and then execute the parse method
      yield scrapy.Request(url=url, callback=self.parse)

  # parse method
  def parse(self, response):
    # Extract Href from all links
    links = response.css('a::attr(href)').extract()
    # Create text file to add links to
    with open('links.txt', 'w') as f:
      # loop each link
      for link in links:
        # Join absolute and Relative URLs
        link = urljoin("https://www.crawler-test.com",link)
        # Write link to file
        f.write(link+'\n')

# Run the Spider
process = CrawlerProcess()
process.crawl(SimpleSpider)
process.start()

We will now go the extra step by building a web crawler using the Scrapy Shell.

How to Build a Web Crawler with the Scrapy Shell

This next section will show you how to use the Scrapy shell to crawl a website.

Scrapy lets you fetch a URL to test server response using scrapy shell in the Terminal. I recommend you start testing the website you want to crawl first to see if there is some kind of problem.

$ scrapy shell

To get help using the shell, use shelp().

To exit the Scrapy shell use CTRL + Z or type exit()

Then, you fetch the URL to process the server response.

$ fetch('https://ca.indeed.com')

As you can see, the server returns a 200 status code.

Analyze Response

To analyze the server response, here are a few useful functions

Read the response

view(response)

Scrapy will open the page for you in a new browser window.

Get Status Code

response.status

View the raw HTML of the page

print(response.text)

Get the Title of the page using Xpath

response.xpath('//title/text()').get()

Get H1s of the page using Xpath

response.xpath('//h1/text()').getall()

Indeed has no H1 on the Homepage. I guess that talks a lot about the importance of the H1 (or of their Homepage).

Extract data using CSS Selectors

Here, I’d like to know what are the most important keywords for Indeed.

response.css('.jobsearch-PopularSearchesPillBoxes-pillBoxText::text').extract()   

Print Response Header

from pprint import pprint
pprint(response.headers)

Start a Project in Scrapy

We will now create a new project. First, exit the Shell by typing exit().

In the terminal, add the following command.

$ scrapy startproject indeed

We call Scrapy using the scrapy command. Startproject will initialize a new directory with the name of the project you give it, in our case indeed. Files like __init.py__ will be added by default to the newly created crawler directory.

Start project with Scrapy
Start a project with Scrapy

Understand Default Files Created

There are 1 folder and 4 files created here. The Spider folder will contain your spiders as you create them. The files created are items.py, pipelines.py,settings.py, middlewares.py.

  • items.py contains the elements that you want to scrape from a page: url, title, meta, etc.
  • pipelines.py contains the code that will tell what to do with the scraped data: cleaning HTML data, validating scraped data, dropping duplicates, storing the scraped item in a database
  • settings.py contains settings of the crawler such as the crawl-delay.
  • middlewares.py is where you can set-up proxies when crawling a website.

scrapy.cfg is the configuration file and __init__.py is the initialization file.

Create Your First Crawler

You can now create your first crawler by accessing the newly created directory and running genspider.

Create the Template File

In the terminal, access the directory.

cd indeed

Create a template file

scrapy genspider indeedCA ca.indeed.com

This will create a spider called ‘indeedCA’ using a ‘basic’ template from Scrapy. A file named indeedCA.py is now accessible in the Spider folder and contains this basic information:

# -*- coding: utf-8 -*-
import scrapy # import

class IndeedSpider(scrapy.Spider):
    name = 'indeedCA'
    allowed_domains = ['indeed.com']
    start_urls = ['http://indeed.com/']

    def parse(self, response):
        pass

What this code does is that it imports scrapy and then creates a Python object, or a class, that contains the code for the spider.

The Scrapy spider class is in the following format:

class SpiderName(scrapy.Spider):
    name = 'SpiderName'
    # Do something

This class inherits the methods from scrapy.Spider and tells what websites to scrape and how to scrape them.

I will modify it a little to crawl for Python Jobs.

Extract Data From the Page

The parse() function in the class is used to extract data from the HTML document. In this tutorial, I will extract the H1, the title and the meta description.

# -*- coding: utf-8 -*-
import scrapy


class IndeedcaSpider(scrapy.Spider):
    name = 'indeedCA'
    allowed_domains = ['ca.indeed.com']
    start_urls = ['https://ca.indeed.com/Python-jobs']

    def parse(self, response):

        print("Fetching ... " + response.url)

        #Extract data
        h1 = response.xpath('//h1/text()').extract()
        title = response.xpath('//title/text()').extract()
        description = response.xpath('//meta[@name="description"]/@content').extract()
        
        # Return in a combined list
        data = zip(h1,title,description)

        for value in data:
            # create a dictionary to store the scraped data
            scraped_data = {
                #key:value
                'url' : response.url,
                'h1' : value[0], 
                'title' : value[1],
                'description' : value[2]
            }

            yield scraped_data

Export Your Data

To export your data to a CSV file, go in the settings.py file and add those lines.

FEED_FORMAT = 'csv'
FEED_URI = 'indeed.csv'

Or you can save it to JSON.

FEED_FORMAT = 'JSON'
FEED_URI = 'indeed.json'

Run the Spider

Running the spider is super simple. Only run this line in the terminal.

$ scrapy crawl indeedCA

This should return a indeed.csv file with the extracted data from https://ca.indeed.com/Python-jobs.

Make Your Web Crawler Respectful

Don’t Crawl Too Fast

Scrapy Web Crawlers are really fast. This means that they fetch a large number of pages in a low amount of time. This puts a strong load on servers.

If you have a web server that can handle 5 requests per second and a web crawler crawls 100 pages per second, or at 100 concurrent requests (threads) per second, that comes with a cost.

Crawl Delay

To avoid hitting the web servers too frequently, use the DOWNLOAD_DELAY setting in your settings.py file.

DOWNLOAD_DELAY = 3

This will add a 3 seconds delay between requests.

Concurrent Requests (threads)

Most servers can handle 4-5 concurrent requests. If crawling speed is not an issue for you, consider reducing to only a single thread.

CONCURRENT_REQUESTS_PER_DOMAIN = 4

AutoThrottle

Each website varies in the number of requests it can handle. AutoThrottle is the perfect setting. It automatically adjusts the delays between requests depending on the current web server load.

AUTOTHROTTLE_ENABLED = True

Identify Yourself

Whenever you hit a server, you leave a trace. Website owners can identify you using your IP address, but also with your user agent.

By identifying yourself as a bot, you help website owners understand your intentions.

USER_AGENT = 'webcrawlerbot (+http://www.yourdomain.com)'

Robots.txt

Websites create Robots.txt to make sure that some pages are not crawled by bot. You could decide that you don’t want to follow robots.txt directions, however you should, both ethically and legally.

ROBOTSTXT_OBEY = True

Use Cache

By building your own web crawler, you will try many things. Sometimes, you will request the same page over and over just to fine-tune your scraper.

There is a functionality that lets you cache the page. This way, when you fetch the page the 2nd, 3rd or 99th time in a minute, it will use the cached version of the page instead of sending a new request.

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 500

Articles Related to Web Scraping

Conclusion

Scrapy is the best and most-powerful open-source library in Python to build a web crawler. We have just scratched the surface, and it may still seem confusing to use.

If this is the case, I suggest that you start with something simpler and learn how to use requests and BeautifulSoup together for webscraping purposes.

5/5 - (1 vote)