This tutorial is for you that wants to crawl a website with Screaming Frog and extract more URLs from those pages to be recrawled.
By using custom extraction, you will end up with a list of new URLs placed into columns not ready to be sent the crawler.
What you need to do is to convert the extraction columns into a database that you can send back to Screaming Frog.
It should be a built-in function, right?
It is simple to copy them manually if you have crawled 10 pages, but not if you have crawled a hundred thousand pages.
However, it is really simple to solve this using Python.
If you don’t know how to use Python, I have an entire Guide dedicated to Python for SEO.
Convert Your Extracted URLs to a Database Using Pandas
First, you need to export your crawl.
Then, we will convert your extracted URLs to a database format using Pandas.
import pandas as pd crawl = pd.read_excel(r'C:\Users\j-c.chouinard\Python\Screaming Frog\burnabycrawl.xlsx') #Transpose DataFrame crawl_transposed = crawl.transpose() #Remove duplicates in all rows for i in range(len(crawl_transposed)): crawl_transposed.iloc[:,i]=crawl_transposed.iloc[:,i].drop_duplicates() #Bring back to the original row/column order crawl_dedup=crawl_transposed.transpose() #Remove Statuses Columns crawl_drop=crawl_tdedup.drop(crawl_tdedup.columns[1:3],axis="columns") #Send it to a database format crawl_db=pd.melt(crawl_drop, id_vars='Address', value_vars=crawl_drop.iloc[:,2:], var_name='extractedUrls', value_name='Extraction').dropna() #Write to excel crawled_urls.to_excel("urls_to_recrawl.xlsx",index=False,header=False)
Other Technical SEO Guides With Python
- Find Rendering Problems On Large Scale Using Python + Screaming Frog
- Recrawl URLs Extracted with Screaming Frog (using Python)
- Find Keyword Cannibalization Using Google Search Console and Python
- Get BERT Score for SEO
- Web Scraping With Python and Requests-HTML
- Randomize User-Agent With Python and BeautifulSoup
- Create a Simple XML Sitemap With Python
You now have a file that you can use to recrawl the URLs extracted from your previous crawl.