Randomize User-Agent With Python and BeautifulSoup (by JR Oakes)

Share this post

This post is part of the complete Guide on Python for SEO

Pandas is the most useful library for me, because of the ability to easily manipulate data tables of millions of rows and import and export via CSVs.  It even connects with Big Query to dump data to the cloud.

This post was written from a Q&A with JR Oakes to help people start with Python for SEO.

Below are a few functions that I use in many projects.  The GET_UA randomizes the User-Agent string to get around servers that throw errors if you try to crawl with the default user-Agent.  Parse_url returns the content, BeautifulSoup-parsed DOM, and content type for a URL. parse_internal_links(soup, current_page) is how you can use those two to grab the internal links for a web page.

Import random
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse


def GET_UA():
    uastrings = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",\
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36",\
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10) AppleWebKit/600.1.25 (KHTML, like Gecko) Version/8.0 Safari/600.1.25",\
                "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0",\
                "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",\
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36",\
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/600.1.17 (KHTML, like Gecko) Version/7.1 Safari/537.85.10",\
                "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",\
                "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0",\
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36"\
                ]

    return random.choice(uastrings)


def parse_url(url):

    headers = {'User-Agent': GET_UA()}
    content = None

    try:
        response = requests.get(url, headers=headers)
        ct = response.headers['Content-Type'].lower().strip()

        if 'text/html' in ct:
            content = response.content
            soup = BeautifulSoup(content, "lxml")
        else:
            content = response.content
            soup = None

    except Exception as e:
        print(“Error:, str(e))

    return content, soup, ct


def parse_internal_links(soup, current_page):
    return [a['href'].lower().strip() for a in soup.find_all('a', href=True) if urlparse(a['href']).netloc == urlparse(current_page).netloc]

Other Technical SEO Guides With Python