Random User-Agent With Python and BeautifulSoup (by JR Oakes)

This post is part of the complete Guide on Python for SEO

In this post, JR Oakes shows us how to use random user-agents with Python and BeautifulSoup.

This post was written from a Q&A with JR Oakes to help people start with Python for SEO.

Pandas is the most useful library for me, because of the ability to easily manipulate data tables of millions of rows and import and export via CSVs.  It even connects with Big Query to dump data to the cloud.

Below are a few functions that I use in many projects.  The GET_UA randomizes the User-Agent string to get around servers that throw errors if you try to crawl with the default User-Agent.  Parse_url returns the content, BeautifulSoup-parsed DOM, and content type for a URL. parse_internal_links(soup, current_page) is how you can use those two to grab the internal links for a web page.

Join the Newsletter

    Import random
    import requests
    from bs4 import BeautifulSoup
    from urllib.parse import urlparse
    
    
    def GET_UA():
        uastrings = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",\
                    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36",\
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10) AppleWebKit/600.1.25 (KHTML, like Gecko) Version/8.0 Safari/600.1.25",\
                    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0",\
                    "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",\
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",\
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/600.1.17 (KHTML, like Gecko) Version/7.1 Safari/537.85.10",\
                    "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",\
                    "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0",\
                    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36"\
                    ]
    
        return random.choice(uastrings)
    
    
    def parse_url(url):
    
        headers = {'User-Agent': GET_UA()}
        content = None
    
        try:
            response = requests.get(url, headers=headers)
            ct = response.headers['Content-Type'].lower().strip()
    
            if 'text/html' in ct:
                content = response.content
                soup = BeautifulSoup(content, "lxml")
            else:
                content = response.content
                soup = None
    
        except Exception as e:
            print(“Error:, str(e))
    
        return content, soup, ct
    
    
    def parse_internal_links(soup, current_page):
        return [a['href'].lower().strip() for a in soup.find_all('a', href=True) if urlparse(a['href']).netloc == urlparse(current_page).netloc]
    
    

    Other Technical SEO Guides With Python

    This is it. Special thanks to JR Oakes for sharing this piece of code to add random user-agents to HTTP requests.

    5/5 - (5 votes)