This post is part of the complete Guide on Python for SEO
In this Python for SEO tutorial, we will learn how to scrape a website with Python using the Requests-HTML
library.
(Code example included)
We will extract basic information from a website.
- Title
- H1
- Links
- Author
- Canonical
- Hreflang
- Meta Robots Tag
- Broken Links
You can find the full code in my Python Notebook.
What is Web Scraping?
Web scraping means the action of parsing the content of a webpage to extract specific information.
Parsing means that you analyze a document to describe the syntax (i.e. the HTML structure). Without a parser, your HTML document will look like a single block of text.
When you are scraping a website, you are asking the server to send you an HTML document that you parse to understand the building blocks (<head>
,<body>
,<title>
,<h1>
, etc.). Once the structure is understood, you can pull out any information that you want.
What is Requests-HTML Library?
The requests-HTML
library is an HTML parser that lets you use CSS Selectors and XPath Selectors to extract the information that you want from a web page.
Install and load Libraries
In this tutorial, we will use the requests
library to “call” the URL by making HTTP requests to servers, the requests-HTML
library to parse the data, and the pandas
library to work with the scraped information.
pip install requests pip install requests-HTML pip install pandas pip install regex pip install urlparse4
Call the URL With requests.get()
Use HTMLSession()
to initialize the GET requests and the .get()
function from requests
to call the URL to scrape.
Here, I will make an example with Hamlet Batista’s amazing intro to Python post.
Just to make sure that there is no error, I will add a try
and except
statement to return an error in any case the code doesn’t work.
We will store the response in a variable called response
.
import requests from requests_html import HTMLSession url = "https://www.searchenginejournal.com/introduction-to-python-seo-spreadsheets/342779/" try: session = HTMLSession() response = session.get(url) except requests.exceptions.RequestException as e: print(e)
Structure of Scraping Functions
The structure of the requests-HTML
parsing call goes like this:
variable.attribute.function(*selector*, parameters)
The variable
is the instance that you created using the .get(url)
function.
The attribute
is the type of content that you want to extract (html
/ lxml
).
The requests-HTML
parser also has many useful built-in methods
for SEOs.
- links: Get all links found on a page (anchors included);
- absolute_links: Get all links found on a page (anchors excluded);
- find(): Find a specific element on a page with a CSS Selector;
- xpath(): Get elements using Xpath function;
Extract the Title From the Page
Here, we are going to use find()
with the html
attribute to “find” the <title>
tag using the 'title'
CSS Selector and return a list of elements ([<Element 'title' >]
).
title = response.html.find('title') print(title) # [<Element 'title' >]
To print to the actual title, we need to use the index with the text
attribute.
print(title[0].text) # An Introduction to Python for SEO Pros Using Spreadsheets
This is the same as using the first
parameter in the function
in a one-liner.
title = response.html.find('title', first=True).text print(title) # An Introduction to Python for SEO Pros Using Spreadsheets
Extract Meta Description
To extract the meta description from a page, we will use the xpath()
function with the //meta[@name="description"]/@content
Xpath.
meta_desc = response.html.xpath('//meta[@name="description"]/@content') print(meta_desc) # ["Learn Python basics while studying code John Mueller recently shared that populates Google Sheets. We'll also modify his code to add a simple visualization."]
Extract All Links From a Webpage
The absolute_links
function lets us extract all links, excluding anchors on a website.
links = response.html.absolute_links print(links)
Extract Information Using Class or ID
You can extract any specific information from a page using the dot (.
) notation to select a class, or the pound (#
) notation to select the ID.
Here we are going to extract the author’s name using the class.
author = response.html.find('.post-author', first=True).text print(author) # Hamlet Batista
Extract Canonical Link
canonical = response.html.xpath("//link[@rel='canonical']/@href") print(canonical) # ['https://www.searchenginejournal.com/introduction-to-python-seo-spreadsheets/342779/']
Extract Hreflang
hreflang = response.html.xpath("//link[@rel='alternate']/@hreflang") print(hreflang)
Extract Meta Robots
meta_robots = response.html.xpath("//meta[@name='ROBOTS']/@content") print(meta_robots) # ['NOODP']
Extract Nested Information
To extract information within a specific location you can dig down the DOM using CSS Selectors.
get_nav_links = response.html.find('a.sub-m-cat span')
We will build a for loop to loop through all the indices in the nav_links
list and add the text to another list called nav_links
.
nav_links = [] for i in range(len(get_nav_links)): x = get_nav_links[i].text nav_links.append(x) nav_links # ['SEO', 'PPC', 'CONTENT', 'SOCIAL', 'NEWS', 'ADVERTISE', 'MORE']
Save a Subsection of a Page in a Variable
If the content that you want to extract is always in a specific <div>
, you can save the path in a variable to call it.
Here, I will extract links that are in the actual content of a post by “saving” the post-342779
article in a variable called article
.
article = response.html.find('article.cis_post_item_initial.post-342779', first=True) article_links = article.xpath('//a/@href')
Case Study: Extract Broken Links
import re import requests from requests_html import HTMLSession from urllib.parse import urlparse # Get Domain Name With urlparse url = "https://www.jobillico.com/fr/partenaires-corporatifs" parsed_url = urlparse(url) domain = parsed_url.scheme + "://" + parsed_url.netloc # Get URL session = HTMLSession() r = session.get(url) # Extract Links jlinks = r.html.xpath('//a/@href') # Remove bad links and replace relative path for absolute path updated_links = [] for link in jlinks: if re.search(".*@.*|.*javascript:.*|.*tel:.*",link): link = "" elif re.search("^(?!http).*",link): link = domain + link updated_links.append(link) else: updated_links.append(link)
Now, it is time to extract broken links and add them to a list.
broken_links = [] for link in updated_links: print(link) try: requests.get(link, timeout=10).status_code if requests.get(link, timeout=10).status_code != 200: broken_links.append(link) except requests.exceptions.RequestException as e: print(e) broken_links
Automate Your Web Scraping Script
To automate the web scraping, schedule your python script on Windows task scheduler, or automate python script using CRON on Mac.
Other Technical SEO Guides With Python
- Find Rendering Problems On Large Scale Using Python + Screaming Frog
- Recrawl URLs Extracted with Screaming Frog (using Python)
- Find Keyword Cannibalization Using Google Search Console and Python
- Get BERT Score for SEO
- Web Scraping With Python and Requests-HTML
- Randomize User-Agent With Python and BeautifulSoup
- Create a Simple XML Sitemap With Python
This is the end of this Python tutorial on web scraping with the requests-HTML
library.
Sr SEO Specialist at Seek (Melbourne, Australia). Specialized in technical SEO. In a quest to programmatic SEO for large organizations through the use of Python, R and machine learning.