Web Scraping With Python and Requests-HTML (with Example)

The Python Requests-HTML library is a web scraping module that offers HTTP requests as well as JavaScript rendering.

In this web scraping tutorial, we will learn how to scrape a website with Python using the Requests-HTML library.

We will extract basic information from a website.

Join the Newsletter

    • Title
    • H1
    • Links
    • Author
    • Canonical
    • Hreflang
    • Meta Robots Tag
    • Broken Links

    You can find the full code in my Python Notebook.

    What is Requests-HTML Library?

    The requests-HTML library is an HTML parser that lets you use CSS Selectors and XPath Expressions to extract the information that you want from a web page. It also offers the capacity to perform JavaScript Rendering.

    What is Web Scraping?

    Web scraping means the action of parsing the content of a webpage to extract specific information.

    Parsing means that you analyze a document to describe the syntax (i.e. the HTML structure). Without a parser, your HTML document will look like a single block of text.

    When you are scraping a website, you are asking the server to send you an HTML document that you parse to understand the building blocks (<head>,<body>,<title>,<h1>, etc.). Once the structure is understood, you can pull out any information that you want.

    How is Requests-HTML Useful in Web Scraping?

    Requests-HTML is a library useful in web scraping when a web page requires JavaScript to be executed. It allows to return content that may not be available with simple HTTP requests.

    Install and load Libraries

    In this tutorial, we will use the requests library to “call” the URL by making HTTP requests to servers, the requests-HTML library to parse the data, and the pandas library to work with the scraped information.

    pip install requests requests-HTML urlparse4
    

    Fetch URL With HTMLSession().get()

    To scrape a web page in Python with the requests-HTML library use the HTMLSession() class initialize the session object. Then, perform a GET request using the .get() method.

    You may get the following error if you are using a Jupyter Notebook, in that case use the AsyncHTMLSession.

    RuntimeError: Cannot use HTMLSession within an existing event loop.

    Here, I will make an example with Hamlet Batista’s amazing intro to Python post.

    Just to make sure that there is no error, I will add a try and except statement to return an error in any case the code doesn’t work.

    We will store the response in a variable called response.

    import requests
    from requests_html import HTMLSession
    
    url = "https://www.searchenginejournal.com/introduction-to-python-seo-spreadsheets/342779/"
    
    try:
        session = HTMLSession()
        response = session.get(url)
        
    except requests.exceptions.RequestException as e:
        print(e)
    

    Structure of Scraping Functions

    The structure of the requests-HTML parsing call goes like this:

    variable.attribute.function(*selector*, parameters)

    The variable is the instance that you created using the .get(url) function.

    The attribute is the type of content that you want to extract (html / lxml).

    The requests-HTML parser also has many useful built-in methods for SEOs.

    • links: Get all links found on a page (anchors included);
    • absolute_links: Get all links found on a page (anchors excluded);
    • find(): Find a specific element on a page with a CSS Selector;
    • xpath(): Get elements using Xpath function;

    Extract the Title From the Page

    Here, we are going to use find() with the html attribute to “find” the <title> tag using the 'title' CSS Selector and return a list of elements ([<Element 'title' >]).

    title =  response.html.find('title')
    print(title)
    # [<Element 'title' >]
    

    To print to the actual title, we need to use the index with the text attribute.

    print(title[0].text)
    # An Introduction to Python for SEO Pros Using Spreadsheets
    

    This is the same as using the first parameter in the function in a one-liner.

    title =  response.html.find('title', first=True).text
    print(title)
    # An Introduction to Python for SEO Pros Using Spreadsheets
    

    Extract Meta Description

    To extract the meta description from a page, we will use the xpath() function with the //meta[@name="description"]/@content Xpath.

    meta_desc =  response.html.xpath('//meta[@name="description"]/@content')
    print(meta_desc)
    # ["Learn Python basics while studying code John Mueller recently shared that populates Google Sheets. We'll also modify his code to add a simple visualization."]
    

    Extract All Links From a Webpage

    The absolute_links function lets us extract all links, excluding anchors on a website.

    links = response.html.absolute_links
    print(links)
    

    Extract Information Using Class or ID

    You can extract any specific information from a page using the dot (.) notation to select a class, or the pound (#) notation to select the ID.

    Here we are going to extract the author’s name using the class.

    author = response.html.find('.post-author', first=True).text
    print(author)
    # Hamlet Batista
    

    Extract Canonical Link

    canonical = response.html.xpath("//link[@rel='canonical']/@href")
    print(canonical)
    # ['https://www.searchenginejournal.com/introduction-to-python-seo-spreadsheets/342779/']
    

    Extract Hreflang

    hreflang = response.html.xpath("//link[@rel='alternate']/@hreflang")
    print(hreflang)
    

    Extract Meta Robots

    meta_robots = response.html.xpath("//meta[@name='ROBOTS']/@content")
    print(meta_robots)
    # ['NOODP']
    

    Extract Nested Information

    To extract information within a specific location you can dig down the DOM using CSS Selectors.

    get_nav_links = response.html.find('a.sub-m-cat span')
    

    We will build a for loop to loop through all the indices in the nav_links list and add the text to another list called nav_links.

    nav_links = []
    
    for i in range(len(get_nav_links)):
        x = get_nav_links[i].text
        nav_links.append(x)
        
    nav_links
    
    # ['SEO', 'PPC', 'CONTENT', 'SOCIAL', 'NEWS', 'ADVERTISE', 'MORE']
    

    Save a Subsection of a Page in a Variable

    If the content that you want to extract is always in a specific <div>, you can save the path in a variable to call it.

    Here, I will extract links that are in the actual content of a post by “saving” the post-342779 article in a variable called article.

    article = response.html.find('article.cis_post_item_initial.post-342779', first=True)
    article_links = article.xpath('//a/@href')
    

    Case Study: Extract Broken Links

    import re
    import requests
    from requests_html import HTMLSession
    from urllib.parse import urlparse
    
    # Get Domain Name With urlparse
    url = "https://www.jobillico.com/fr/partenaires-corporatifs"
    parsed_url = urlparse(url)
    domain = parsed_url.scheme + "://" + parsed_url.netloc
    
    # Get URL 
    session = HTMLSession()
    r = session.get(url)
    
    # Extract Links
    jlinks = r.html.xpath('//a/@href')
    
    # Remove bad links and replace relative path for absolute path
    updated_links = []
    
    for link in jlinks:
        if re.search(".*@.*|.*javascript:.*|.*tel:.*",link):
            link = ""
        elif re.search("^(?!http).*",link):
            link = domain + link
            updated_links.append(link)
        else:
            updated_links.append(link)
    

    Now, it is time to extract broken links and add them to a list.

    broken_links = []
    
    for link in updated_links:
        print(link)
        try: 
            requests.get(link, timeout=10).status_code
            if requests.get(link, timeout=10).status_code != 200:
                broken_links.append(link)
        except requests.exceptions.RequestException as e:
            print(e)
    
    broken_links
    

    Scraping JavaScript With Requests HTML and AsyncHTMLSession

    To execute and scrape JavaScript with Requests-HTML, use the .html.render() method on the HTMLSession object if you are not using Jupyter or the .html.arender() method on the AsyncHTMLSession object. Then, use the find() method to extract data.

    from requests_html import AsyncHTMLSession
    
    url = 'https://crawler-test.com/javascript/dynamically-inserted-text'
    
    session = AsyncHTMLSession()
    
    async def get_html(url):
        r = await session.get(url)
        await r.html.arender()
        h1 = r.html.find('h1', first=True)
        print(h1.text)
    
    await get_html(url)
    

    Parse Requests-HTML Response with BeautifulSoup

    While Requests-HTML provides what you need to parse HTMl, you may want to parse the HTML using BeautifulSoup instead.

    To parse the HTML of the Requests-HTML object with BeautifulSoup, pass the response.html.raw_html attribute to the BeautifulSoup object.

    # requests-html beautifulsoup
    from bs4 import BeautifulSoup
    from requests_html import HTMLSession
    
    url = 'https://crawler-test.com/'
    session = HTMLSession()
    r = session.get(url)
    soup = BeautifulSoup(r.html.raw_html, features='lxml')
    soup.find('h1')
    

    HTMLSession() Methods

    close()If a browser was created close it first.
    delete()Sends a DELETE request.
    get()Sends a GET request.
    head()Sends a HEAD request.
    mount()Registers a connection adapter to a prefix.
    options()Sends a OPTIONS request.
    patch()Sends a PATCH request.
    post()Sends a POST request.
    put()Sends a PUT request.
    request()Constructs a :class:Request <Request>, prepares it and sends it. Returns :class:Response <Response> object.
    send()Send a given PreparedRequest.

    Alternative to Requests-HTML

    There are many Python libraries that can be used for web scraping.

    • Python Requests: Performing HTTP requests
    • Python BeautifulSoup: Parsing Library
    • Python Selenium: Browser-based web scraping library
    • Python Scrapy: Web Scraping framework that include recursive crawling capabilities

    Automate Your Web Scraping Script

    To automate the web scraping, schedule your python script on Windows task scheduler, or automate python script using CRON on Mac.

    Other Technical SEO Guides With Python

    Articles Related to Web Scraping

    This is the end of this Python tutorial on web scraping with the requests-HTML library.

    4.3/5 - (11 votes)