In this post, JR Oakes shows us how to use random user-agents with Python and BeautifulSoup.
This post was written from a Q&A with JR Oakes to help people start with Python for SEO.
Pandas is the most useful library for me, because of the ability to easily manipulate data tables of millions of rows and import and export via CSVs. It even connects with Big Query to dump data to the cloud.
Below are a few functions that I use in many projects. The GET_UA
randomizes the User-Agent string to get around servers that throw errors if you try to crawl with the default User-Agent. Parse_url
returns the content, BeautifulSoup-parsed DOM, and content type for a URL. parse_internal_links
(soup, current_page) is how you can use those two to grab the internal links for a web page.
Import random
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
def GET_UA():
uastrings = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36",\
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10) AppleWebKit/600.1.25 (KHTML, like Gecko) Version/8.0 Safari/600.1.25",\
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0",\
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",\
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",\
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/600.1.17 (KHTML, like Gecko) Version/7.1 Safari/537.85.10",\
"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",\
"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0",\
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36"\
]
return random.choice(uastrings)
def parse_url(url):
headers = {'User-Agent': GET_UA()}
content = None
try:
response = requests.get(url, headers=headers)
ct = response.headers['Content-Type'].lower().strip()
if 'text/html' in ct:
content = response.content
soup = BeautifulSoup(content, "lxml")
else:
content = response.content
soup = None
except Exception as e:
print(“Error:, str(e))
return content, soup, ct
def parse_internal_links(soup, current_page):
return [a['href'].lower().strip() for a in soup.find_all('a', href=True) if urlparse(a['href']).netloc == urlparse(current_page).netloc]
Other Technical SEO Guides With Python
- Find Rendering Problems On Large Scale Using Python + Screaming Frog
- Recrawl URLs Extracted with Screaming Frog (using Python)
- Find Keyword Cannibalization Using Google Search Console and Python
- Get BERT Score for SEO
- Web Scraping With Python and Requests-HTML
- Randomize User-Agent With Python and BeautifulSoup
- Create a Simple XML Sitemap With Python
- Web Scraping with Scrapy and Python
This is it. Special thanks to JR Oakes for sharing this piece of code to add random user-agents to HTTP requests.
Senior Director, Technical SEO Research at Locomotive