Natzir Turrado has built a fantastic keyword cannibalization tool for Python to identify possible keyword cannibalization issues using Google Search Console data.
His work has been the key to unlocking some of the biggest challenges that I have faced with Python so far.
He kindly accepted that I wrote this blog post to showcase his work as part of my guide to Python for SEO.
Here is how you can identify cannibalization issues in Python, using Jupyter Notebook.
Import Libraries
For this tutorial, we’ll use many libraries such as Pandas, datetime, requests, BeautifulSoup, and more.
import pandas as pd
import re
import datetime
import qgrid
from collections import defaultdict
from dateutil import relativedelta
import httplib2
from apiclient import errors
from apiclient.discovery import build
from oauth2client.client import OAuth2WebServerFlow
import requests
from bs4 import BeautifulSoup
Add Your Google Search Console API Keys
It is now time to add your own Google CLIENT_ID, CLIENT_SECRET & your Search Console SITE PROPERTY.
If you don’t know how just read my blog post on how to get your Google Search Console API Keys.
CLIENT_ID = ''
CLIENT_SECRET = ''
site = ''
Make Your Search Console API Call
To make the API call to the Google Search Console API, you will need to set-up the OAuth Scope, authorize your credentials with httplib2
and build your service.
If you don’t know-how, you can read the guide that shows you how to make an API call with Search Console.
Add the start date and end date.
end_date = datetime.date.today()
#end_date = datetime.date.today()-relativedelta.relativedelta(days=3) #My Note
start_date = end_date - relativedelta.relativedelta(months=3)
Quick comment on Natzir work. You should use
datetime.date.today()-relativedelta.relativedelta(days=3)
as end date since Google Search console only provide the latest date from 3 days ago.
Authorize Your Connection
OAUTH_SCOPE = 'https://www.googleapis.com/auth/webmasters.readonly'
REDIRECT_URI = 'urn:ietf:wg:oauth:2.0:oob'
# Run through the OAuth flow and retrieve credentials
flow = OAuth2WebServerFlow(CLIENT_ID, CLIENT_SECRET, OAUTH_SCOPE, redirect_uri=REDIRECT_URI)
authorize_url = flow.step1_get_authorize_url()
print ('Go to the following link in your browser: ' + authorize_url)
code = input('Enter verification code: ').strip()
credentials = flow.step2_exchange(code)
# Create an httplib2.Http object and authorize it with our credentials
http = httplib2.Http()
http = credentials.authorize(http)
webmasters_service = build('webmasters', 'v3', http=http)
Execute the API Request
Here we will call the Google Search Console API and request queries per page with a limit of 25000 rows.
Just know that you can get more than 25 000 rows with Google Search Console API.
def execute_request(service, property_uri, request):
return service.searchanalytics().query(siteUrl=property_uri, body=request).execute()
request = {
'startDate': datetime.datetime.strftime(start_date,"%Y-%m-%d"),
'endDate': datetime.datetime.strftime(end_date,'%Y-%m-%d'),
'dimensions': ['page','query'],
'rowLimit': 25000 #up to 25.000 urls
}
#Adding a device filter to request
device_category = input('Enter device category: MOBILE, DESKTOP or TABLET (leave it blank for all devices): ').strip()
if device_category:
request['dimensionFilterGroups'] = [{'filters':[{'dimension':'device','expression':device_category}]}]
#Request to SC API
response = execute_request(webmasters_service, site, request)
Parse the JSON Returned
Parsing will help the computer understand the structure of the JSON file and add it to a dictionary.
scDict = defaultdict(list)
for row in response['rows']:
scDict['page'].append(row['keys'][0] or 0)
scDict['query'].append(row['keys'][1] or 0)
scDict['clicks'].append(row['clicks'] or 0)
scDict['ctr'].append(row['ctr'] or 0)
scDict['impressions'].append(row['impressions'] or 0)
scDict['position'].append(row['position'] or 0)
Create a DataFrame With the Google Search Console Data
Now, we will convert the dictionary into a Pandas Dataframe where you can do whatever you want with the data.
df = pd.DataFrame(data = scDict)
df['clicks'] = df['clicks'].astype('int')
df['ctr'] = df['ctr']*100
df['impressions'] = df['impressions'].astype('int')
df['position'] = df['position'].round(2)
df.sort_values('clicks',inplace=True,ascending=False)
df
Clean the DataFrame and Sort by Query
Let’s clean the resulting dataFrame.
We will:
- Exclude page one (view note below)
- Exclude branded queries
- Exclude unique queries
According to Mozcast, the average page one has only 8 listings, so we only look for URLs beyond this position (this behaviour could by changed in ‘SERP_result’).
Note by Natzir Turrado
SERP_results = 8 #insert here your prefered value for SERP results
branded_queries = 'brand|vrand|b rand...' #insert here your branded queries
df_canibalized = df[df['position'] > SERP_results]
df_canibalized = df_canibalized[~df_canibalized['query'].str.contains(branded_queries, regex=True)]
df_canibalized = df_canibalized[df_canibalized.duplicated(subset=['query'], keep=False)]
df_canibalized.set_index(['query'],inplace=True)
df_canibalized.sort_index(inplace=True)
df_canibalized.reset_index(inplace=True)
df_canibalized
Scrape URLs and Add Titles and Meta Descriptions
Using requests to fetch titles and Meta Descriptions will let us view duplicate content.
def get_meta(url):
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find('title').get_text()
meta = soup.select('meta[name="description"]')[0].attrs["content"]
return title, meta
df_canibalized['title'],df_canibalized['meta'] = zip(*df_canibalized['page'].apply(get_meta))
df_canibalized
Visualize Data in Jupyter Notebook
This is now the moment of glory.
Use show_grid
(works only in Jupyter) to get the dynamic grid to analyze the data.
grid = qgrid.show_grid(df_canibalized, show_toolbar=True)
grid
Automate the Process
I am still struggling to make this an automated process yet. One thing is sure, you will need to authenticate with a JSON credential file to be able to achieve this.
Other Technical SEO Guides With Python
- Find Rendering Problems On Large Scale Using Python + Screaming Frog
- Recrawl URLs Extracted with Screaming Frog (using Python)
- Find Keyword Cannibalization Using Google Search Console and Python
- Get BERT Score for SEO
- Web Scraping With Python and Requests-HTML
- Randomize User-Agent With Python and BeautifulSoup
- Create a Simple XML Sitemap With Python
- Web Scraping with Scrapy and Python
This is the conclusion of the amazing work done by Natzir. If you ever manage to automate the process, please contact me and I’ll be happy to publish and update.
SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.