Find Keyword Cannibalization Using Google Search Console and Python

This post is part of the Guide on Python for SEO and the guide on the Google Search Console API

Natzir Turrado has built a fantastic keyword cannibalization tool for Python to identify possible keyword cannibalization issues using Google Search Console data.

His work has been the key to unlocking some of the biggest challenges that I have faced with Python so far.

He kindly accepted that I wrote this blog post to showcase his work as part of my guide to Python for SEO.

Join the Newsletter

    keyword-cannibalization
    Keyword Cannibalization Tool for Python

    Here is how you can identify cannibalization issues in Python, using Jupyter Notebook.

    Import Libraries

    For this tutorial, we’ll use many libraries such as Pandas, datetime, requests, BeautifulSoup, and more.

    import pandas as pd
    import re
    import datetime
    import qgrid
    from collections import defaultdict
    from dateutil import relativedelta
    import httplib2
    from apiclient import errors
    from apiclient.discovery import build
    from oauth2client.client import OAuth2WebServerFlow
    import requests
    from bs4 import BeautifulSoup
    

    Add Your Google Search Console API Keys

    It is now time to add your own Google CLIENT_ID, CLIENT_SECRET & your Search Console SITE PROPERTY.

    If you don’t know how just read my blog post on how to get your Google Search Console API Keys.

    CLIENT_ID = ''
    CLIENT_SECRET = ''
    site = ''
    

    Make Your Search Console API Call

    To make the API call to the Google Search Console API, you will need to set-up the OAuth Scope, authorize your credentials with httplib2 and build your service.

    If you don’t know-how, you can read the guide that shows you how to make an API call with Search Console.

    Add the start date and end date.

    end_date = datetime.date.today()
    #end_date = datetime.date.today()-relativedelta.relativedelta(days=3) #My Note
    start_date = end_date - relativedelta.relativedelta(months=3)
    

    Quick comment on Natzir work. You should use datetime.date.today()-relativedelta.relativedelta(days=3) as end date since Google Search console only provide the latest date from 3 days ago.

    Authorize Your Connection

    OAUTH_SCOPE = 'https://www.googleapis.com/auth/webmasters.readonly'
    REDIRECT_URI = 'urn:ietf:wg:oauth:2.0:oob'
    
    # Run through the OAuth flow and retrieve credentials
    flow = OAuth2WebServerFlow(CLIENT_ID, CLIENT_SECRET, OAUTH_SCOPE, redirect_uri=REDIRECT_URI)
    authorize_url = flow.step1_get_authorize_url()
    print ('Go to the following link in your browser: ' + authorize_url)
    code = input('Enter verification code: ').strip()
    credentials = flow.step2_exchange(code)
    
    # Create an httplib2.Http object and authorize it with our credentials
    http = httplib2.Http()
    http = credentials.authorize(http)
    
    webmasters_service = build('webmasters', 'v3', http=http)
    

    Execute the API Request

    Here we will call the Google Search Console API and request queries per page with a limit of 25000 rows.

    Just know that you can get more than 25 000 rows with Google Search Console API.

    def execute_request(service, property_uri, request):
        return service.searchanalytics().query(siteUrl=property_uri, body=request).execute()
    
    request = {
        'startDate': datetime.datetime.strftime(start_date,"%Y-%m-%d"),
        'endDate': datetime.datetime.strftime(end_date,'%Y-%m-%d'),
        'dimensions': ['page','query'],
        'rowLimit': 25000 #up to 25.000 urls
    }
    
    #Adding a device filter to request
    device_category = input('Enter device category: MOBILE, DESKTOP or TABLET (leave it blank for all devices): ').strip()
    if device_category:
        request['dimensionFilterGroups'] = [{'filters':[{'dimension':'device','expression':device_category}]}]
    
    #Request to SC API
    response = execute_request(webmasters_service, site, request)
    

    Parse the JSON Returned

    Parsing will help the computer understand the structure of the JSON file and add it to a dictionary.

    scDict = defaultdict(list)
    
    for row in response['rows']:
        scDict['page'].append(row['keys'][0] or 0)
        scDict['query'].append(row['keys'][1] or 0)
        scDict['clicks'].append(row['clicks'] or 0)
        scDict['ctr'].append(row['ctr'] or 0)
        scDict['impressions'].append(row['impressions'] or 0)
        scDict['position'].append(row['position'] or 0)
    

    Create a DataFrame With the Google Search Console Data

    Now, we will convert the dictionary into a Pandas Dataframe where you can do whatever you want with the data.

    df = pd.DataFrame(data = scDict)
    
    df['clicks'] = df['clicks'].astype('int')
    df['ctr'] = df['ctr']*100
    df['impressions'] = df['impressions'].astype('int')
    df['position'] = df['position'].round(2)
    df.sort_values('clicks',inplace=True,ascending=False)
    df
    

    Clean the DataFrame and Sort by Query

    Let’s clean the resulting dataFrame.

    We will:

    • Exclude page one (view note below)
    • Exclude branded queries 
    • Exclude unique queries

    According to Mozcast, the average page one has only 8 listings, so we only look for URLs beyond this position (this behaviour could by changed in ‘SERP_result’).

    Note by Natzir Turrado
    SERP_results = 8 #insert here your prefered value for SERP results
    branded_queries = 'brand|vrand|b rand...' #insert here your branded queries
    
    df_canibalized = df[df['position'] > SERP_results] 
    df_canibalized = df_canibalized[~df_canibalized['query'].str.contains(branded_queries, regex=True)]
    df_canibalized = df_canibalized[df_canibalized.duplicated(subset=['query'], keep=False)]
    df_canibalized.set_index(['query'],inplace=True)
    df_canibalized.sort_index(inplace=True)
    df_canibalized.reset_index(inplace=True)
    df_canibalized
    

    Scrape URLs and Add Titles and Meta Descriptions

    Using requests to fetch titles and Meta Descriptions will let us view duplicate content.

    def get_meta(url):
        page = requests.get(url)
        soup = BeautifulSoup(page.content,'html.parser')
        title = soup.find('title').get_text()
        meta = soup.select('meta[name="description"]')[0].attrs["content"] 
        return title, meta
    
    df_canibalized['title'],df_canibalized['meta'] = zip(*df_canibalized['page'].apply(get_meta))
    df_canibalized
    

    Visualize Data in Jupyter Notebook

    This is now the moment of glory.

    Use show_grid (works only in Jupyter) to get the dynamic grid to analyze the data.

    grid = qgrid.show_grid(df_canibalized, show_toolbar=True)
    grid
    

    Automate the Process

    I am still struggling to make this an automated process yet. One thing is sure, you will need to authenticate with a JSON credential file to be able to achieve this.

    image

    Other Technical SEO Guides With Python

    This is the conclusion of the amazing work done by Natzir. If you ever manage to automate the process, please contact me and I’ll be happy to publish and update.

    4/5 - (17 votes)