gsc api get all data python

The code from this post has been updated and might not work correctly. The update is available on Github. Watch this video to show how to use the Google Search Console API.

If you wonder how you can extract all your keywords from Google Search Console, you’re at the right place.

This post will show you how you can automatically extract all your search traffic and all your keywords from Google Webmaster Tool using Python and the Google Search Console API.

(Step-by-step guide with code sample)

Join the Newsletter

    The perfect solution for SEOs working on large websites that have a lot of pages and a lot of keywords.

    That’s right, you will learn how you can extract more than 25,000 keywords and automatically export the data to a CSV file.

    With this script, I managed to extract over 500K rows of data each month from the GSC API.

    If you want to look at a more complex, but more powerful solution to extract all your Google Search Console data, you can check this introduction to GSC API video tutorial.

    To learn more about Python, view my complete guide on Python for SEO.

    This post is part of the guide on the Google Search Console API

    Limits of Google Search Console

    GSC is a free tool offered by Google, so it is normal that they don’t show you all the data to everyone, as this would be a very expensive process. Instead, they offer you an API.

    With GSC, you can’t get a report of top-performing keywords by page, or by country, or by the device.

    The Google Search Console performance data is also limiting you to 1000 rows of data, which can be annoying.

    The GSC API lets you create keywords by page reports and also lets you query all the pages and all the keywords possible to extract from GSC.

    Extract All Your Keywords Data From Google Search Console

    For those not used to working with APIs, it can be a hurdle to start, but you should, because Google APIs are awesome for SEOs.

    Fear not, I will walk you through to help you get started.

    You can learn everything from this post, but if you feel that you need an easier starter guide, just read my post “How to Use Google Search Console API With Python“.

    Here are the steps to extract all your search traffic data from Google Search Console.

    1. Get Your Google Search Console API Key
    2. Import Libraries
    3. Log in to the API
    4. Make Your API Call

    Step 1: Get Your Google Search Console API Key

    First, you need to get an API key to connect to the Google Search Console API.

    What is an API Key?

    The API key is kind of the username and password that you need to give to access the API.

    How do you get the GSC API Keys?

    Follow this guide to help you get the detailed steps to get your Google Search Console API key.

    Otherwise, here are the simplified steps to get your Google Search Console API keys.

    1. Go to Google’s developers console, and sign-in;
    2. Go to “Dashboard” and click “Enable APIs and Services” ;
    3. Search for “Google Search Console API” and enable the API;
    4. Go to the “credential” tab, click on “create credential” and select “OAuth Client ID”;
    5. Click on “configure the consent screen ” and give a name to your product;
    6. Choose “Other” as the application type and click create;
    7. Copy the client ID and client Secret or go-on and save to download the JSON file.

    Step 2: Import Libraries

    Here are the libraries that we will need to run this code:

    • Pandas, Datetime, collections, httplib2, googleapiclient, argparse, re, os, urllib, dateutil
    import pandas as pd
    import datetime
    from datetime import date, timedelta
    import httplib2
    from googleapiclient.discovery import build
    from oauth2client.client import OAuth2WebServerFlow
    from collections import defaultdict
    from dateutil import relativedelta
    import argparse
    from oauth2client import client
    from oauth2client import file
    from oauth2client import tools
    import re
    import os
    from urllib.parse import urlparse
    

    Step 3: Create Directory for Your Project

    The goal here is to use the URL that you have added to create a directory for your project to store the CSV Data. These two functions will extract the domain name and create the project using the domain name.

    # Get Domain Name to Create a Project
    def get_domain_name(start_url):
        domain_name = '{uri.netloc}'.format(uri=urlparse(start_url))  # Get Domain Name To Name Project
        domain_name = domain_name.replace('.','_')
        return domain_name
    
    
    # Create a project Directory for this website
    def create_project(directory):
        if not os.path.exists(directory):
            print('Create project: '+ directory)
            os.makedirs(directory)
    

    Step 4. Log in to the API

    Now, to log in the API, you will need to download your JSON API Key from step 1 and save it as client_secrets.jsonin your working folder.

    def authorize_creds(creds):
        # Variable parameter that controls the set of resources that the access token permits.
        SCOPES = ['https://www.googleapis.com/auth/webmasters.readonly'] 
    
        # Path to client_secrets.json file
        CLIENT_SECRETS_PATH = creds
    
        # Create a parser to be able to open browser for Authorization
        parser = argparse.ArgumentParser(
            formatter_class=argparse.RawDescriptionHelpFormatter,
            parents=[tools.argparser])
        flags = parser.parse_args([])
    
        flow = client.flow_from_clientsecrets(
            CLIENT_SECRETS_PATH, scope = SCOPES,
            message = tools.message_if_missing(CLIENT_SECRETS_PATH))
    
        # Prepare credentials and authorize HTTP
        # If they exist, get them from the storage object
        # credentials will get written back to a file.
        storage = file.Storage('authorizedcreds.dat')
        credentials = storage.get()
    
        # If authenticated credentials don't exist, open Browser to authenticate
        if credentials is None or credentials.invalid:
            credentials = tools.run_flow(flow, storage, flags)
        http = credentials.authorize(http=httplib2.Http())
        webmasters_service = build('searchconsole', 'v1', http=http)
        return webmasters_service
    

    Step 5: Execute Request

    We will create the python function to execute the request to the Google Search Console API.

    def execute_request(service, property_uri, request): 
        return service.searchanalytics().query(siteUrl=property_uri, body=request).execute()
    

    Step 6: Create Function to Process the CSV Files

    Extracting Google Search Console (GSC) data by date can be a long task to run if you have a lot of pages. What we will want to do is to save the data to a CSV file whenever we have finished extracting data from a specific date.

    We will also want to check if the dates have been processed, so we don’t extract the same information twice.

    # Create function to write to CSV
    def write_to_csv(data,filename):
        if not os.path.isfile(filename):
            data.to_csv(filename)
        else: # else it exists so append without writing the header
            data.to_csv(filename, mode='a', header=False)
    
    # Read CSV if it exists to find dates that have already been processed.
    def get_dates_from_csv(path):
        if os.path.isfile(path):
            data = pd.read_csv(path)
            data = pd.Series(data['date'].unique())
            return data
        else:
            pass
    

    Step 6: Create Function to Extract All the Data

    # Create function to extract all the data
    def extract_data(site,creds,num_days,output):
        domain_name = get_domain_name(site)             # Get Domain From URL
        create_project(domain_name)                     # Create a new project folder
        full_path = domain_name + '/' + output          # Get Path to add all files
        csv_dt = get_dates_from_csv(full_path)          # Read existing CSV
        webmasters_service = authorize_creds(creds)     # Get credentials to log in the api
    
        # Set up Dates
        end_date = datetime.date.today() -\
             relativedelta.relativedelta(days=3)        # Start 3 days in the past, since GSC don't show latest data.
        start_date = end_date -\
             relativedelta.relativedelta(days=num_days) # Get end date minus the number of days to go back in time.
        delta = datetime.timedelta(days=1)              # This will let us loop one day at the time
        scDict = defaultdict(list)                      # initialize empty Dict to store data
    
        while start_date <= end_date:                   # Loop through all dates until start_date is equal to end_date.
            
            # If a GSC csv file exists from previous extraction
            # and dates in the file match to dates we are extracting...
            if csv_dt is not None and \
                csv_dt.str.contains(dt_to_str(start_date)).any(): 
                print('Existing Date: %s' % start_date) #... Print the date
                start_date += delta                     #... and increment without extraction  
            else:                                       # If the file doesn't exist, or date don't match...
                # ... Print and start the extraction
                print('Start date at beginning: %s' % start_date)
    
                maxRows = 25000 # Maximum 25K per call 
                numRows = 0     # Start at Row Zero
                status = ''     # Initialize status of extraction
    
    
                while (status != 'Finished') : # As long as today's data have not been fully extracted.
                    # Extract this information from GSC
                    request = {
                        'startDate': dt_to_str(start_date),     # Get today's date (while loop)
                        'endDate': dt_to_str(start_date),       # Get today's date (while loop)
                        'dimensions': ['date','page','query'],  # Extract This information
                        'rowLimit': maxRows,                    # Set number of rows to extract at once (max 25k)
                        'startRow': numRows                     # Start at row 0, then row 25k, then row 50k... until with all.
                    }
    
                    response = execute_request(webmasters_service, site, request)
                    
                    #Process the response
                    try:
                        for row in response['rows']:
                            scDict['date'].append(row['keys'][0] or 0)    
                            scDict['page'].append(row['keys'][1] or 0)
                            scDict['query'].append(row['keys'][2] or 0)
                            scDict['clicks'].append(row['clicks'] or 0)
                            scDict['ctr'].append(row['ctr'] or 0)
                            scDict['impressions'].append(row['impressions'] or 0)
                            scDict['position'].append(row['position'] or 0)
                        print('successful at %i' % numRows)
    
                    except:
                        print('error occurred at %i' % numRows)
    
                    # Add response to dataframe 
                    df = pd.DataFrame(data = scDict)
                    df['clicks'] = df['clicks'].astype('int')
                    df['ctr'] = df['ctr']*100
                    df['impressions'] = df['impressions'].astype('int')
                    df['position'] = df['position'].round(2)
    
                    # Increment the 'start_row'
                    print('Numrows at the start of loop: %i' % numRows)
                    try: 
                        numRows = numRows + len(response['rows'])
                    except:
                        status = 'Finished'                 # If no response left, change status
                    print('Numrows at the end of loop: %i' % numRows)
                    if numRows % maxRows != 0:              # If numRows not divisible by 25k...
                        status = 'Finished'                 # change status, you have covered all lines.                
            
                start_date += delta                         # Increment start_date to continue the loop
                print('Start date at end: %s' % start_date) 
                write_to_csv(df,full_path)                  # Write today's data to CSV file.
            df = pd.read_csv(full_path)
        return df
    

    Step 7: Execute Request

    site = 'https://www.yoursite.com'    # Property to extract
    num_days = 5                         # Number of Days, Months to Extract
    creds = 'client_secrets.json'        # Credential file from GSC
    output = 'gsc_data.csv'
    df = extract_data(site,creds,num_days,output)
    df.sort_values('clicks',ascending=False)
    

    Full Code

    import pandas as pd
    import datetime
    from datetime import date, timedelta
    import httplib2
    from googleapiclient.discovery import build
    from oauth2client.client import OAuth2WebServerFlow
    from collections import defaultdict
    from dateutil import relativedelta
    import argparse
    from oauth2client import client
    from oauth2client import file
    from oauth2client import tools
    import re
    import os
    from urllib.parse import urlparse
    
    site = 'https://www.yoursite.com'    # Property to extract
    num_days = 5                    # Number of Days, Months to Extract
    creds = 'client_secrets.json'   # Credential file from GSC
    output = 'gsc_data.csv'
    
    # Get Domain Name to Create a Project
    def get_domain_name(start_url):
        domain_name = '{uri.netloc}'.format(uri=urlparse(start_url))  # Get Domain Name To Name Project
        domain_name = domain_name.replace('.','_')
        return domain_name
    
    
    # Create a project Directory for this website
    def create_project(directory):
        if not os.path.exists(directory):
            print('Create project: '+ directory)
            os.makedirs(directory)
    
    def authorize_creds(creds):
        # Variable parameter that controls the set of resources that the access token permits.
        SCOPES = ['https://www.googleapis.com/auth/webmasters.readonly'] 
    
        # Path to client_secrets.json file
        CLIENT_SECRETS_PATH = creds
    
        # Create a parser to be able to open browser for Authorization
        parser = argparse.ArgumentParser(
            formatter_class=argparse.RawDescriptionHelpFormatter,
            parents=[tools.argparser])
        flags = parser.parse_args([])
    
        flow = client.flow_from_clientsecrets(
            CLIENT_SECRETS_PATH, scope = SCOPES,
            message = tools.message_if_missing(CLIENT_SECRETS_PATH))
    
        # Prepare credentials and authorize HTTP
        # If they exist, get them from the storage object
        # credentials will get written back to a file.
        storage = file.Storage('authorizedcreds.dat')
        credentials = storage.get()
    
        # If authenticated credentials don't exist, open Browser to authenticate
        if credentials is None or credentials.invalid:
            credentials = tools.run_flow(flow, storage, flags)
        http = credentials.authorize(http=httplib2.Http())
        webmasters_service = build('webmasters', 'v3', http=http)
        return webmasters_service
    
    # Create Function to execute your API Request
    def execute_request(service, property_uri, request):
        return service.searchanalytics().query(siteUrl=property_uri, body=request).execute()
    
    # Create function to write to CSV
    def write_to_csv(data,filename):
        if not os.path.isfile(filename):
            data.to_csv(filename)
        else: # else it exists so append without writing the header
            data.to_csv(filename, mode='a', header=False)
    
    # Read CSV if it exists to find dates that have already been processed.
    def get_dates_from_csv(path):
        if os.path.isfile(path):
            data = pd.read_csv(path)
            data = pd.Series(data['date'].unique())
            return data
        else:
            pass
    
       
    # Create function to extract all the data
    def extract_data(site,creds,num_days,output):
        domain_name = get_domain_name(site)
        create_project(domain_name)
        full_path = domain_name + '/' + output
        current_dates = get_dates_from_csv(full_path)
    
        webmasters_service = authorize_creds(creds)
    
        # Set up Dates
        end_date = datetime.date.today() - relativedelta.relativedelta(days=3)
        start_date = end_date - relativedelta.relativedelta(days=num_days)
        delta = datetime.timedelta(days=1) # This will let us loop one day at the time
        scDict = defaultdict(list)
    
        while start_date <= end_date:
            if current_dates is not None and current_dates.str.contains(datetime.datetime.strftime(start_date,'%Y-%m-%d')).any():
                print('Existing Date: %s' % start_date)
                start_date += delta     
            else:
                print('Start date at beginning: %s' % start_date)
    
                maxRows = 25000 # Maximum 25K per call 
                numRows = 0     # Start at Row Zero
                status = ''     # Initialize status of extraction
    
    
                while (status != 'Finished') : # Test with i < 10 just to see how long the task will take to process.
                    request = {
                        'startDate': datetime.datetime.strftime(start_date,'%Y-%m-%d'),
                        'endDate': datetime.datetime.strftime(start_date,'%Y-%m-%d'),
                        'dimensions': ['date','page','query'],
                        'rowLimit': maxRows, 
                        'startRow': numRows
                    }
    
                    response = execute_request(webmasters_service, site, request)
    
                    try:
                    #Process the response
                        for row in response['rows']:
                            scDict['date'].append(row['keys'][0] or 0)    
                            scDict['page'].append(row['keys'][1] or 0)
                            scDict['query'].append(row['keys'][2] or 0)
                            scDict['clicks'].append(row['clicks'] or 0)
                            scDict['ctr'].append(row['ctr'] or 0)
                            scDict['impressions'].append(row['impressions'] or 0)
                            scDict['position'].append(row['position'] or 0)
                        print('successful at %i' % numRows)
    
                    except:
                        print('error occurred at %i' % numRows)
    
                    #Add response to dataframe 
                    df = pd.DataFrame(data = scDict)
                    df['clicks'] = df['clicks'].astype('int')
                    df['ctr'] = df['ctr']*100
                    df['impressions'] = df['impressions'].astype('int')
                    df['position'] = df['position'].round(2)
    
                    print('Numrows at the start of loop: %i' % numRows)
                    try: 
                        numRows = numRows + len(response['rows'])
                    except:
                        status = 'Finished'
                    print('Numrows at the end of loop: %i' % numRows)
                    if numRows % maxRows != 0:
                        status = 'Finished'
            
                start_date += delta  
                print('Start date at end: %s' % start_date) 
                write_to_csv(df,full_path)
        return df
    
    df = extract_data(site,creds,num_days,output)
    df.sort_values('clicks',ascending=False)
    

    You can run the task daily using Windows Task Scheduler and save the data to a CSV, or a MySQL database.

    Automate Your Python Script

    To automate the Google Search Console API queries, schedule your python script on Windows task scheduler, or automate python script using CRON on Mac.


    Google Search Console FAQs

    ?What is Google Search Console?

    Google Search Console (GSC), formerly Google Webmaster Tool (GWT), is the platform that Google offers to help monitor a website’s search performance. It provides tools that let webmasters test whether a page is indexed, how it is rendered and other traffic metrics like clicks, impressions, click-through-rate and position in Google.

    What Is the Google Search Console API?

    The Google Search Console API is a free service that lets you extract data from the search performance report. You can then store it and use it however it pleases you.

    ?How to get Google Search Console API Key?

    You can connect to the API using API keys from the google developer console. Follow this guide to help you get the detailed steps to get your Google Search Console API key.

    4.3/5 - (16 votes)