Find Keyword Cannibalization Using Google Search Console and Python

This post is part of the Guide on Python for SEO and the guide on the Google Search Console API

Natzir Turrado has built a fantastic keyword cannibalization tool for Python to identify possible keyword cannibalization issues using Google Search Console data.

His work has been the key to unlocking some of the biggest challenges that I have faced with Python so far.

He kindly accepted that I wrote this blog post to showcase his work as part of my guide to Python for SEO.

keyword-cannibalization — Keyword Cannibalization Tool for Python

Here is how you can identify cannibalization issues in Python, using Jupyter Notebook.

Import Libraries

For this tutorial, we’ll use many libraries such as Pandas, datetime, requests, BeautifulSoup, and more.

import pandas as pd
import re
import datetime
import qgrid
from collections import defaultdict
from dateutil import relativedelta
import httplib2
from apiclient import errors
from apiclient.discovery import build
from oauth2client.client import OAuth2WebServerFlow
import requests
from bs4 import BeautifulSoup

Add Your Google Search Console API Keys

It is now time to add your own Google CLIENT_ID, CLIENT_SECRET & your Search Console SITE PROPERTY.

If you don’t know how just read my blog post on how to get your Google Search Console API Keys.

CLIENT_ID = ''
CLIENT_SECRET = ''
site = ''

Make Your Search Console API Call

To make the API call to the Google Search Console API, you will need to set-up the OAuth Scope, authorize your credentials with httplib2 and build your service.

If you don’t know-how, you can read the guide that shows you how to make an API call with Search Console.

Add the start date and end date.

end_date = datetime.date.today()
#end_date = datetime.date.today()-relativedelta.relativedelta(days=3) #My Note
start_date = end_date - relativedelta.relativedelta(months=3)

Quick comment on Natzir work. You should use datetime.date.today()-relativedelta.relativedelta(days=3) as end date since Google Search console only provide the latest date from 3 days ago.

Authorize Your Connection

OAUTH_SCOPE = 'https://www.googleapis.com/auth/webmasters.readonly'
REDIRECT_URI = 'urn:ietf:wg:oauth:2.0:oob'

# Run through the OAuth flow and retrieve credentials
flow = OAuth2WebServerFlow(CLIENT_ID, CLIENT_SECRET, OAUTH_SCOPE, redirect_uri=REDIRECT_URI)
authorize_url = flow.step1_get_authorize_url()
print ('Go to the following link in your browser: ' + authorize_url)
code = input('Enter verification code: ').strip()
credentials = flow.step2_exchange(code)

# Create an httplib2.Http object and authorize it with our credentials
http = httplib2.Http()
http = credentials.authorize(http)

webmasters_service = build('webmasters', 'v3', http=http)

Execute the API Request

Here we will call the Google Search Console API and request queries per page with a limit of 25000 rows.

Just know that you can get more than 25 000 rows with Google Search Console API.

def execute_request(service, property_uri, request):
    return service.searchanalytics().query(siteUrl=property_uri, body=request).execute()

request = {
    'startDate': datetime.datetime.strftime(start_date,"%Y-%m-%d"),
    'endDate': datetime.datetime.strftime(end_date,'%Y-%m-%d'),
    'dimensions': ['page','query'],
    'rowLimit': 25000 #up to 25.000 urls
}

#Adding a device filter to request
device_category = input('Enter device category: MOBILE, DESKTOP or TABLET (leave it blank for all devices): ').strip()
if device_category:
    request['dimensionFilterGroups'] = [{'filters':[{'dimension':'device','expression':device_category}]}]

#Request to SC API
response = execute_request(webmasters_service, site, request)

Parse the JSON Returned

Parsing will help the computer understand the structure of the JSON file and add it to a dictionary.

scDict = defaultdict(list)

for row in response['rows']:
    scDict['page'].append(row['keys'][0] or 0)
    scDict['query'].append(row['keys'][1] or 0)
    scDict['clicks'].append(row['clicks'] or 0)
    scDict['ctr'].append(row['ctr'] or 0)
    scDict['impressions'].append(row['impressions'] or 0)
    scDict['position'].append(row['position'] or 0)

Create a DataFrame With the Google Search Console Data

Now, we will convert the dictionary into a Pandas Dataframe where you can do whatever you want with the data.

df = pd.DataFrame(data = scDict)

df['clicks'] = df['clicks'].astype('int')
df['ctr'] = df['ctr']*100
df['impressions'] = df['impressions'].astype('int')
df['position'] = df['position'].round(2)
df.sort_values('clicks',inplace=True,ascending=False)
df

Clean the DataFrame and Sort by Query

Let’s clean the resulting dataFrame.

We will:

Exclude page one (view note below)
Exclude branded queries
Exclude unique queries

According to Mozcast, the average page one has only 8 listings, so we only look for URLs beyond this position (this behaviour could by changed in ‘SERP_result’).
Note by Natzir Turrado

SERP_results = 8 #insert here your prefered value for SERP results
branded_queries = 'brand|vrand|b rand...' #insert here your branded queries

df_canibalized = df[df['position'] > SERP_results] 
df_canibalized = df_canibalized[~df_canibalized['query'].str.contains(branded_queries, regex=True)]
df_canibalized = df_canibalized[df_canibalized.duplicated(subset=['query'], keep=False)]
df_canibalized.set_index(['query'],inplace=True)
df_canibalized.sort_index(inplace=True)
df_canibalized.reset_index(inplace=True)
df_canibalized

Scrape URLs and Add Titles and Meta Descriptions

Using requests to fetch titles and Meta Descriptions will let us view duplicate content.

def get_meta(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content,'html.parser')
    title = soup.find('title').get_text()
    meta = soup.select('meta[name="description"]')[0].attrs["content"] 
    return title, meta

df_canibalized['title'],df_canibalized['meta'] = zip(*df_canibalized['page'].apply(get_meta))
df_canibalized

Visualize Data in Jupyter Notebook

This is now the moment of glory.

Use show_grid (works only in Jupyter) to get the dynamic grid to analyze the data.

grid = qgrid.show_grid(df_canibalized, show_toolbar=True)
grid

Automate the Process

I am still struggling to make this an automated process yet. One thing is sure, you will need to authenticate with a JSON credential file to be able to achieve this.

Other Technical SEO Guides With Python

This is the conclusion of the amazing work done by Natzir. If you ever manage to automate the process, please contact me and I’ll be happy to publish and update.

4/5 - (17 votes)

Jean-Christophe Chouinard

SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.