Recrawl URLs Extracted with Screaming Frog (using Python)

Share this post

This post is part of the complete Guide on Python for SEO

This tutorial is for you that wants to crawl a website with Screaming Frog and extract more URLs from those pages to be recrawled.

By using custom extraction, you will end up with a list of new URLs placed into columns not ready to be sent the crawler.

What you need to do is to convert the extraction columns into a database that you can send back to Screaming Frog.

It should be a built-in function, right?

But no.

It is simple to copy them manually if you have crawled 10 pages, but not if you have crawled a hundred thousand pages.

However, it is really simple to solve this using Python.

If you don’t know how to use Python, I have an entire Guide dedicated to Python for SEO.

Convert Your Extracted URLs to a Database Using Pandas

First, you need to export your crawl.

Then, we will convert your extracted URLs to a database format using Pandas.

import pandas as pd
crawl = pd.read_excel(r'C:\Users\j-c.chouinard\Python\Screaming Frog\burnabycrawl.xlsx')

#Transpose DataFrame
crawl_transposed = crawl.transpose()

#Remove duplicates in all rows
for i in range(len(crawl_transposed)):
    crawl_transposed.iloc[:,i]=crawl_transposed.iloc[:,i].drop_duplicates()

#Bring back to the original row/column order
crawl_dedup=crawl_transposed.transpose()

#Remove Statuses Columns
crawl_drop=crawl_tdedup.drop(crawl_tdedup.columns[1:3],axis="columns")

#Send it to a database format
crawl_db=pd.melt(crawl_drop, id_vars='Address', value_vars=crawl_drop.iloc[:,2:], var_name='extractedUrls', value_name='Extraction').dropna()

#Write to excel
crawled_urls.to_excel("urls_to_recrawl.xlsx",index=False,header=False)

Other Technical SEO Guides With Python

You now have a file that you can use to recrawl the URLs extracted from your previous crawl.