If you wonder how you can extract all your keywords from Google Search Console, you’re at the right place.
This post will show you how you can automatically extract all your search traffic and all your keywords from Google Webmaster Tool using Python and the Google Search Console API.
(Step-by-step guide with code sample)
The perfect solution for SEOs working on large websites that have a lot of pages and a lot of keywords.
That’s right, you will learn how you can extract more than 25,000 keywords and automatically export the data to a CSV file.
With this script, I managed to extract over 500K rows of data each month from the GSC API.
If you want to look at a more complex, but more powerful solution to extract all your Google Search Console data, you can check this introduction to GSC API video tutorial.
To learn more about Python, view my complete guide on Python for SEO.
This post is part of the guide on the Google Search Console API
Limits of Google Search Console
GSC is a free tool offered by Google, so it is normal that they don’t show you all the data to everyone, as this would be a very expensive process. Instead, they offer you an API.
With GSC, you can’t get a report of top-performing keywords by page, or by country, or by the device.
The Google Search Console performance data is also limiting you to 1000 rows of data, which can be annoying.
The GSC API lets you create keywords by page reports and also lets you query all the pages and all the keywords possible to extract from GSC.
Extract All Your Keywords Data From Google Search Console
For those not used to working with APIs, it can be a hurdle to start, but you should, because Google APIs are awesome for SEOs.
Fear not, I will walk you through to help you get started.
You can learn everything from this post, but if you feel that you need an easier starter guide, just read my post “How to Use Google Search Console API With Python“.
Here are the steps to extract all your search traffic data from Google Search Console.
- Get Your Google Search Console API Key
- Import Libraries
- Log in to the API
- Make Your API Call
Step 1: Get Your Google Search Console API Key
First, you need to get an API key to connect to the Google Search Console API.
What is an API Key?
The API key is kind of the username and password that you need to give to access the API.
How do you get the GSC API Keys?
Follow this guide to help you get the detailed steps to get your Google Search Console API key.
Otherwise, here are the simplified steps to get your Google Search Console API keys.
- Go to Google’s developers console, and sign-in;
- Go to “Dashboard” and click “Enable APIs and Services” ;
- Search for “Google Search Console API” and enable the API;
- Go to the “credential” tab, click on “create credential” and select “OAuth Client ID”;
- Click on “configure the consent screen ” and give a name to your product;
- Choose “Other” as the application type and click create;
- Copy the client ID and client Secret or go-on and save to download the JSON file.
Step 2: Import Libraries
Here are the libraries that we will need to run this code:
- Pandas, Datetime, collections, httplib2, googleapiclient, argparse, re, os, urllib, dateutil
import pandas as pd
import datetime
from datetime import date, timedelta
import httplib2
from googleapiclient.discovery import build
from oauth2client.client import OAuth2WebServerFlow
from collections import defaultdict
from dateutil import relativedelta
import argparse
from oauth2client import client
from oauth2client import file
from oauth2client import tools
import re
import os
from urllib.parse import urlparse
Step 3: Create Directory for Your Project
The goal here is to use the URL that you have added to create a directory for your project to store the CSV Data. These two functions will extract the domain name and create the project using the domain name.
# Get Domain Name to Create a Project
def get_domain_name(start_url):
domain_name = '{uri.netloc}'.format(uri=urlparse(start_url)) # Get Domain Name To Name Project
domain_name = domain_name.replace('.','_')
return domain_name
# Create a project Directory for this website
def create_project(directory):
if not os.path.exists(directory):
print('Create project: '+ directory)
os.makedirs(directory)
Step 4. Log in to the API
Now, to log in the API, you will need to download your JSON API Key from step 1 and save it as client_secrets.json
in your working folder.
def authorize_creds(creds):
# Variable parameter that controls the set of resources that the access token permits.
SCOPES = ['https://www.googleapis.com/auth/webmasters.readonly']
# Path to client_secrets.json file
CLIENT_SECRETS_PATH = creds
# Create a parser to be able to open browser for Authorization
parser = argparse.ArgumentParser(
formatter_class=argparse.RawDescriptionHelpFormatter,
parents=[tools.argparser])
flags = parser.parse_args([])
flow = client.flow_from_clientsecrets(
CLIENT_SECRETS_PATH, scope = SCOPES,
message = tools.message_if_missing(CLIENT_SECRETS_PATH))
# Prepare credentials and authorize HTTP
# If they exist, get them from the storage object
# credentials will get written back to a file.
storage = file.Storage('authorizedcreds.dat')
credentials = storage.get()
# If authenticated credentials don't exist, open Browser to authenticate
if credentials is None or credentials.invalid:
credentials = tools.run_flow(flow, storage, flags)
http = credentials.authorize(http=httplib2.Http())
webmasters_service = build('searchconsole', 'v1', http=http)
return webmasters_service
Step 5: Execute Request
We will create the python function to execute the request to the Google Search Console API.
def execute_request(service, property_uri, request):
return service.searchanalytics().query(siteUrl=property_uri, body=request).execute()
Step 6: Create Function to Process the CSV Files
Extracting Google Search Console (GSC) data by date can be a long task to run if you have a lot of pages. What we will want to do is to save the data to a CSV file whenever we have finished extracting data from a specific date.
We will also want to check if the dates have been processed, so we don’t extract the same information twice.
# Create function to write to CSV
def write_to_csv(data,filename):
if not os.path.isfile(filename):
data.to_csv(filename)
else: # else it exists so append without writing the header
data.to_csv(filename, mode='a', header=False)
# Read CSV if it exists to find dates that have already been processed.
def get_dates_from_csv(path):
if os.path.isfile(path):
data = pd.read_csv(path)
data = pd.Series(data['date'].unique())
return data
else:
pass
Step 6: Create Function to Extract All the Data
# Create function to extract all the data
def extract_data(site,creds,num_days,output):
domain_name = get_domain_name(site) # Get Domain From URL
create_project(domain_name) # Create a new project folder
full_path = domain_name + '/' + output # Get Path to add all files
csv_dt = get_dates_from_csv(full_path) # Read existing CSV
webmasters_service = authorize_creds(creds) # Get credentials to log in the api
# Set up Dates
end_date = datetime.date.today() -\
relativedelta.relativedelta(days=3) # Start 3 days in the past, since GSC don't show latest data.
start_date = end_date -\
relativedelta.relativedelta(days=num_days) # Get end date minus the number of days to go back in time.
delta = datetime.timedelta(days=1) # This will let us loop one day at the time
scDict = defaultdict(list) # initialize empty Dict to store data
while start_date <= end_date: # Loop through all dates until start_date is equal to end_date.
# If a GSC csv file exists from previous extraction
# and dates in the file match to dates we are extracting...
if csv_dt is not None and \
csv_dt.str.contains(dt_to_str(start_date)).any():
print('Existing Date: %s' % start_date) #... Print the date
start_date += delta #... and increment without extraction
else: # If the file doesn't exist, or date don't match...
# ... Print and start the extraction
print('Start date at beginning: %s' % start_date)
maxRows = 25000 # Maximum 25K per call
numRows = 0 # Start at Row Zero
status = '' # Initialize status of extraction
while (status != 'Finished') : # As long as today's data have not been fully extracted.
# Extract this information from GSC
request = {
'startDate': dt_to_str(start_date), # Get today's date (while loop)
'endDate': dt_to_str(start_date), # Get today's date (while loop)
'dimensions': ['date','page','query'], # Extract This information
'rowLimit': maxRows, # Set number of rows to extract at once (max 25k)
'startRow': numRows # Start at row 0, then row 25k, then row 50k... until with all.
}
response = execute_request(webmasters_service, site, request)
#Process the response
try:
for row in response['rows']:
scDict['date'].append(row['keys'][0] or 0)
scDict['page'].append(row['keys'][1] or 0)
scDict['query'].append(row['keys'][2] or 0)
scDict['clicks'].append(row['clicks'] or 0)
scDict['ctr'].append(row['ctr'] or 0)
scDict['impressions'].append(row['impressions'] or 0)
scDict['position'].append(row['position'] or 0)
print('successful at %i' % numRows)
except:
print('error occurred at %i' % numRows)
# Add response to dataframe
df = pd.DataFrame(data = scDict)
df['clicks'] = df['clicks'].astype('int')
df['ctr'] = df['ctr']*100
df['impressions'] = df['impressions'].astype('int')
df['position'] = df['position'].round(2)
# Increment the 'start_row'
print('Numrows at the start of loop: %i' % numRows)
try:
numRows = numRows + len(response['rows'])
except:
status = 'Finished' # If no response left, change status
print('Numrows at the end of loop: %i' % numRows)
if numRows % maxRows != 0: # If numRows not divisible by 25k...
status = 'Finished' # change status, you have covered all lines.
start_date += delta # Increment start_date to continue the loop
print('Start date at end: %s' % start_date)
write_to_csv(df,full_path) # Write today's data to CSV file.
df = pd.read_csv(full_path)
return df
Step 7: Execute Request
site = 'https://www.yoursite.com' # Property to extract
num_days = 5 # Number of Days, Months to Extract
creds = 'client_secrets.json' # Credential file from GSC
output = 'gsc_data.csv'
df = extract_data(site,creds,num_days,output)
df.sort_values('clicks',ascending=False)
Full Code
import pandas as pd
import datetime
from datetime import date, timedelta
import httplib2
from googleapiclient.discovery import build
from oauth2client.client import OAuth2WebServerFlow
from collections import defaultdict
from dateutil import relativedelta
import argparse
from oauth2client import client
from oauth2client import file
from oauth2client import tools
import re
import os
from urllib.parse import urlparse
site = 'https://www.yoursite.com' # Property to extract
num_days = 5 # Number of Days, Months to Extract
creds = 'client_secrets.json' # Credential file from GSC
output = 'gsc_data.csv'
# Get Domain Name to Create a Project
def get_domain_name(start_url):
domain_name = '{uri.netloc}'.format(uri=urlparse(start_url)) # Get Domain Name To Name Project
domain_name = domain_name.replace('.','_')
return domain_name
# Create a project Directory for this website
def create_project(directory):
if not os.path.exists(directory):
print('Create project: '+ directory)
os.makedirs(directory)
def authorize_creds(creds):
# Variable parameter that controls the set of resources that the access token permits.
SCOPES = ['https://www.googleapis.com/auth/webmasters.readonly']
# Path to client_secrets.json file
CLIENT_SECRETS_PATH = creds
# Create a parser to be able to open browser for Authorization
parser = argparse.ArgumentParser(
formatter_class=argparse.RawDescriptionHelpFormatter,
parents=[tools.argparser])
flags = parser.parse_args([])
flow = client.flow_from_clientsecrets(
CLIENT_SECRETS_PATH, scope = SCOPES,
message = tools.message_if_missing(CLIENT_SECRETS_PATH))
# Prepare credentials and authorize HTTP
# If they exist, get them from the storage object
# credentials will get written back to a file.
storage = file.Storage('authorizedcreds.dat')
credentials = storage.get()
# If authenticated credentials don't exist, open Browser to authenticate
if credentials is None or credentials.invalid:
credentials = tools.run_flow(flow, storage, flags)
http = credentials.authorize(http=httplib2.Http())
webmasters_service = build('webmasters', 'v3', http=http)
return webmasters_service
# Create Function to execute your API Request
def execute_request(service, property_uri, request):
return service.searchanalytics().query(siteUrl=property_uri, body=request).execute()
# Create function to write to CSV
def write_to_csv(data,filename):
if not os.path.isfile(filename):
data.to_csv(filename)
else: # else it exists so append without writing the header
data.to_csv(filename, mode='a', header=False)
# Read CSV if it exists to find dates that have already been processed.
def get_dates_from_csv(path):
if os.path.isfile(path):
data = pd.read_csv(path)
data = pd.Series(data['date'].unique())
return data
else:
pass
# Create function to extract all the data
def extract_data(site,creds,num_days,output):
domain_name = get_domain_name(site)
create_project(domain_name)
full_path = domain_name + '/' + output
current_dates = get_dates_from_csv(full_path)
webmasters_service = authorize_creds(creds)
# Set up Dates
end_date = datetime.date.today() - relativedelta.relativedelta(days=3)
start_date = end_date - relativedelta.relativedelta(days=num_days)
delta = datetime.timedelta(days=1) # This will let us loop one day at the time
scDict = defaultdict(list)
while start_date <= end_date:
if current_dates is not None and current_dates.str.contains(datetime.datetime.strftime(start_date,'%Y-%m-%d')).any():
print('Existing Date: %s' % start_date)
start_date += delta
else:
print('Start date at beginning: %s' % start_date)
maxRows = 25000 # Maximum 25K per call
numRows = 0 # Start at Row Zero
status = '' # Initialize status of extraction
while (status != 'Finished') : # Test with i < 10 just to see how long the task will take to process.
request = {
'startDate': datetime.datetime.strftime(start_date,'%Y-%m-%d'),
'endDate': datetime.datetime.strftime(start_date,'%Y-%m-%d'),
'dimensions': ['date','page','query'],
'rowLimit': maxRows,
'startRow': numRows
}
response = execute_request(webmasters_service, site, request)
try:
#Process the response
for row in response['rows']:
scDict['date'].append(row['keys'][0] or 0)
scDict['page'].append(row['keys'][1] or 0)
scDict['query'].append(row['keys'][2] or 0)
scDict['clicks'].append(row['clicks'] or 0)
scDict['ctr'].append(row['ctr'] or 0)
scDict['impressions'].append(row['impressions'] or 0)
scDict['position'].append(row['position'] or 0)
print('successful at %i' % numRows)
except:
print('error occurred at %i' % numRows)
#Add response to dataframe
df = pd.DataFrame(data = scDict)
df['clicks'] = df['clicks'].astype('int')
df['ctr'] = df['ctr']*100
df['impressions'] = df['impressions'].astype('int')
df['position'] = df['position'].round(2)
print('Numrows at the start of loop: %i' % numRows)
try:
numRows = numRows + len(response['rows'])
except:
status = 'Finished'
print('Numrows at the end of loop: %i' % numRows)
if numRows % maxRows != 0:
status = 'Finished'
start_date += delta
print('Start date at end: %s' % start_date)
write_to_csv(df,full_path)
return df
df = extract_data(site,creds,num_days,output)
df.sort_values('clicks',ascending=False)
You can run the task daily using Windows Task Scheduler and save the data to a CSV, or a MySQL database.
Automate Your Python Script
To automate the Google Search Console API queries, schedule your python script on Windows task scheduler, or automate python script using CRON on Mac.
Google Search Console FAQs
?What is Google Search Console?
⭐What Is the Google Search Console API?
?How to get Google Search Console API Key?
SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.