Find Rendering Problems On Large Scale Using Python + Screaming Frog

I’ve been struggling for the past 6 months to find a solution to extract rendering problems on a large scale.

Fantastic enough, Tobias Willmann has published the solution to this very problem using Python Pandas and Screaming Frog SEO Spider.

I’ll try here to replicate his solution with a little more details and further explaination. To learn more on Python for SEO and Datascience, follow the complete series dedicated to it.

JavaScript SEO Basics

JavaScript is handled differently by Google. It is delayed until Googlebot has the resources to render the JS content.

When we do JavaScript SEO, we want to make sure that critical content on a page, that needs to be executed via JavaScript, is being correctly rendered, indexed and ranked in search results by search engines.

How Does Google Handle JavaScript Rendering?

Google has recently revealed their current two wave process for JS rendering and indexing at Google I/O.

Google Has Two Waves of Indexing

Google basically says this: The rendering of JS is deffered until Googlebot has resources available to process that content.

This means that they first process the HTML, and IF they have more resources, they’ll come back to load your JavaScript content.

Google Rendering Process

Why Test Rendering?

Deffered.

Yes, all your cute little content hidden behind JavaScript will be deffered, until Google finds it suitable to come back and spend money on your site to load your JS resources.

This is simple.

If you have important content, or important links. Make sure that Googlebot can find it straight in your HTML code.

But what happen when you have thousands of pages and you don’t know which resources are loaded via JavaScript, and which are not?

This is what this guide is for.

But first.

Let’s dive into Python Basics.

Python Basics

To understand this guide you will need to have Python Installed, and you will need to have at least a basic knowledge of Python.

If you have no idea on how Python Works, just look out my two guides on the subject: how to intall Python with Anaconda and my Python Basics Complete Guide (to be coming).

This guide will be fully explained using Spyder that is natively installed when you install Python using Anaconda.

How to Test JavaScript Rendering on a Large Scale? (Step-By-Step)

Now it is time to put our website’s JavaScript (JS) to the test.

What we want to do is to:

  1. Make two crawls with Screaming Frog, one with “Text Only” rendering and the other with “JavaScript” rendering.
  2. Export the Data in CSV
  3. Load the Crawl Data Using Python
  4. Combine the Crawls Into One Data Frame
  5. Check Differences Between Crawls
  6. Make a Report With Excel

Step #1: Make Two Crawls With Screaming Frog

Let’s make a crawl of our website.

in fact, to compare which pages load properly let’s make two crawls:

  1. Text Only rendered
  2. JavaScript rendered

Text Only Rendered Crawl

First, let’s crawl our website like Googlebot would do it in it’s first wave, before it renders the JS.

Go in Screaming frog > Configuration > Rendering > Text Only

text-only-rendered-crawl
Text-only Rendering

JavaScript Rendered Crawl

Now, let’s crawl our website including rendered results. This will mimic which link that Google will find in its second wave, where it renders the JS content after it has available resources.

Go in Screaming frog > Configuration > Rendering > JavaScript

Careful!

Make sure that you unselect “Enable Redered Page Screen Shots” if you have a really large site. This will speed this process up. You can always recrawl the problematic URLs later on in list mode if you want to se the rendered pages Screen Shots.

JavaScript Rendering

Step #2: Export The Data To CSV

Now that your crawl is complete, you will want to export the Data to CSV.

Go in Screaming Frog > Export

Step #3: Load the Crawl Data Using Python

Since JS rendering mostly affect SEOs in its capacity to render links and content, we’ll try to see if a bot can load the content by checking the word count and the Link information in the Crawl.

  • Address
  • Status Code
  • Word Count
  • Outlinks
  • Unique Outlinks
  • Inlinks
  • Unique Inlinks
  • Canonical Link Element 1

dfTextonly = pd.DataFrame(pd.read_csv('Text-only-crawl.csv', low_memory=False, header=1))

dfTextonly = dfTextonly[['Address','Status Code', 'Word Count' ,'Outlinks', 'Unique Outlinks', 'Inlinks', 'Unique Inlinks',"Canonical Link Element 1"]].copy()

dfJS = pd.DataFrame(pd.read_csv('JS-Rendered-crawl.csv', low_memory=False, header=1))

dfJS = dfJS[['Address','Status Code', 'Word Count', 'Outlinks', 'Unique Outlinks', 'Inlinks', 'Unique Inlinks',"Canonical Link Element 1"]].copy()

Source: Tobias Willmann

Step #4: Combine the Crawls Into One Data Frame

This is an easy step, just copy the code below.

df = pd.merge(dfTextonly, dfJS, left_on='Address', right_on='Address', how='outer')

Source: Tobias Willmann

What you’ll get is a new dataframe with the same column names twice.

Pandas has automatically added the “_x” and “_y” to the data of the first and the second crawls.

  • _x is your data from “Text Only”
  • _y is you crawl data from “JavaScript”
Combined Crawl Data Frame

Note: In Spyder IDE,

Step #5: Check Differences Between Crawls

Here, what we will do is to count the differences in the number of words and the number of links from our “Text Only” crawl and our “JavaScript Rendered” crawl.

We want to flag pages with big differences between each other.

Why?

Because, it will mean that a lot of content is hidden behind JavaScript and can’t be accessed from Google’s first wave crawling.

##Check the differences in each crawl

df['Diff Wordcount'] = df['Word Count_y'] - df['Word Count_x']
df['Diff Outlinks'] = df['Outlinks_y'] - df['Outlinks_x']
df['Diff Unique Outlinks'] = df['Unique Outlinks_y'] - df['Unique Outlinks_x']
df['Diff Inlinks'] = df['Unique Inlinks_y'] - df['Unique Inlinks_x']

##Check if canonical links are equivalent

## Need NumPy library

import numpy as np

df["Canonicals are equal"] = np.where((df["Canonical Link Element 1_y"] == df["Canonical Link Element 1_x"]), "yes", "no")

Source: Tobias Willmann

Here, you should get a result that looks like this.

Step #6: Make a Report With Excel

To export your data into Excel, just use the to_excel panda function.

Export in Excel
df.to_excel("rendering-test.xlsx")

Full Python Code

##Print the path of your current working directory
import os
print(os.getcwd())
#What you get here is where you should save your CSV crawls

##Import Panda Library
import pandas as pd
import numpy as np

##Load the crawls to Pandas
dfTextonly = pd.DataFrame(pd.read_csv('Text-only-5000-crawl.csv', low_memory=False, header=1))
dfTextonly = dfTextonly[['Address','Status Code', 'Word Count' ,'Outlinks', 'Unique Outlinks', 'Inlinks', 'Unique Inlinks',"Canonical Link Element 1"]].copy()
dfJS = pd.DataFrame(pd.read_csv('JS-Rendered-5000-crawl.csv', low_memory=False, header=1))
dfJS = dfJS[['Address','Status Code', 'Word Count', 'Outlinks', 'Unique Outlinks', 'Inlinks', 'Unique Inlinks',"Canonical Link Element 1"]].copy()
Combine the two crawls into one dataframe
df = pd.merge(dfTextonly, dfJS, left_on='Address', right_on='Address', how='outer')

##Check the differences
df['Diff Wordcount'] = df['Word Count_y'] - df['Word Count_x']
df['Diff Outlinks'] = df['Outlinks_y'] - df['Outlinks_x']
df['Diff Unique Outlinks'] = df['Unique Outlinks_y'] - df['Unique Outlinks_x']
df['Diff Inlinks'] = df['Unique Inlinks_y'] - df['Unique Inlinks_x']
df["Canonicals are equal"] = np.where((df["Canonical Link Element 1_y"] == df["Canonical Link Element 1_x"]), "yes", "no")

##Export in Excel
df.to_excel("rendering-test.xlsx")

This is it, you can now compare at a large scale which pages doesn’t load properly without JavaScript.