How to Use Causal Impact in Python (pyCausalImpact) With Examples

Evaluate the results of an SEO experiment on your site using Google Search Console and CausalImpact with Python.

In this tutorial, we will learn how to use the pyCausalImpact python wrapper on Google Search Console data in two ways:

CausalImpact is a package created by Kay H. Brodersen that uses Bayesian statistics to infer the causal effect of an event.


How to format your data for CausalImpact

CausalImpact can be used in two ways:

Join the Newsletter

    1. Simple pre-post experiment
    2. Using control groups

    Simple pre-post experiment

    Your dataset should be a table with dates and a single y column.

    • The date column
    • The y column contains data from a single site.

    With Google Search Console, the y column could contain clicks, impressions, CTR or position.

    Using control groups

    When comparing against control groups, your dataset should be a table like this:

    • dates
    • y column containing data from the sites which you tested on
    • Xn columns containing data from the sites/subfolders which the test was not turned on. Each X column represents a different feature.

    In Google Search Console, y could contain clicks from Site A, the X1 column could contain clicks from Site B and X2, clicks from Site C.


    Defining test and control groups

    Control groups in causal inference are datasets that are not impacted by an experiment than can used to improve the prediction on your test data.

    Control groups can be different things:

    1. Independent data such as search trends of a topic in Google Trends
    2. Different TLD (example1.com, example2.com)
    3. Different ccTLD (example.com, example.ca)
    4. Different subdomains (ca.example.com, au.example.com)
    5. Different subfolders (example.com/ca, example.com/au)

    By convention, the test group is usually labelled as y and the control groups as X.


    Getting Started

    First, you will need to install Python and install some packages.

    $ pip install pycausalimpact
    $ pip install searchconsole
    

    Run Causal Impact with Python on Extracted GSC data

    The simplest way to load Google Search Console data is through a simple export in the performance report.

    Load Search Console data

    And then load the data with pandas and define your parameters.

    import pandas as pd 
    
    X = pd.read_csv('control_gsc_data.csv')
    y = pd.read_csv('test_gsc_data.csv')
    
    # define metric that you want to test
    # impressions, clicks, ctr
    metric = 'clicks' 
    
    # define intervention data
    intervention = '2021-08-01'
    

    Execute Causal Impact

    from causalimpact import CausalImpact
    
    def get_pre_post(data):
        """Get pre-post periods based on the intervention date
    
        Args:
            data (dataframe): df comming from create_master_df()
    
        Returns:
            tuple: tuple of lists showing index edges of period before and after intervention
        """        
        pre_start = min(data.index)
        pre_end = int(data[data['date'] == intervention].index.values)
        post_start = pre_end + 1
        post_end = max(data.index)
    
        pre_period = [pre_start, pre_end] 
        post_period = [post_start, post_end]
        return pre_period, post_period
    
    
    def make_report(data, pre_period, post_period):
        """Creates the built-in CausalImpact report
    
        Args:
            data (dataframe): df comming from create_master_df()
            pre_period (list): list coming from get_pre_post()
            post_period (list): list coming from get_pre_post()
        """        
        ci = CausalImpact(data.drop(['date'], axis=1), pre_period, post_period)
        print(ci.summary())
        print(ci.summary(output='report'))
        ci.plot()
    
    
    if __name__ == '__main__':
        y = y[['date', metric]].rename(columns={metric:'y'})
        X = X[['date', metric]].rename(columns={metric:'X'})
        data = y.merge(X, on='date', how='left')
        data.sort_values(by='date').reset_index(drop=True)
        pre_period, post_period = get_pre_post(data)
        make_report(data, pre_period, post_period)
    

    Run Causal Impact with GSC API

    Get your credentials file

    To use the Google Search Console API, you will need to get your Google Search Console credentials and save them into a client_secrets.json file.

    Authenticate

    Then, you will need to authenticate.

    import searchconsole
    
    def authenticate(config='client_secrets.json', token='credentials.json'):
        """Authenticate GSC"""
        if os.path.isfile(token):
            account = searchconsole.authenticate(client_config=config,
                                                credentials=token)
        else:
            account = searchconsole.authenticate(client_config=config,
                                            serialize=token)
        return account
    
    account = authenticate()
    

    At this stage, your browser will open and ask you to authenticate the API.

    Create the class to process data

    from causalimpact import CausalImpact
    from functools import reduce
    import numpy as np
    import pandas as pd 
    import searchconsole
    
    
    class Causal:
    
    
        def __init__(self, account, intervention, test_sites, control_sites='', months=-16, metric='clicks', dimension='date'):
            self.account = account
            self.test_sites = test_sites
            self.control_sites = control_sites if control_sites else None
            self.intervention = intervention
            self.metric = metric
            self.months = months
            self.dimension = dimension
    
    
        def run_causal(self):
            """Combines all the functions together
    
            Returns:
                [df]: dataframe on which CI was run
            """        
            data = self.create_master_df()
            pre_period, post_period = self.get_pre_post(data)
            self.make_report(data, pre_period, post_period)
            return data
    
    
        def extract_from_list(self, sites):
            """Extract GSC data from a list of sites
    
            Args:
                sites (list): list of properties validated in GSC
    
            Returns:
                [list]: List of dataframes extracted from GSC
            """        
            print(f'Extracting data for {sites}')
            dfs = []
            for site in sites:
                print(f'Extracting: {site}')
                webproperty = self.account[site]
                report = webproperty.query\
                        .range('today', months=self.months)\
                        .dimension(self.dimension)\
                        .get()
                df = report.to_dataframe()
                dfs.append(df)
            return dfs
    
    
        def concat_test(self, dfs):
            """Concatenate the dataframes used for testing
    
            Args:
                dfs (list): List of dataframes extracted from GSC
    
            Returns:
                dataframe: merged test dataframes summed together 
            """        
            concat_df = pd.concat(dfs)
            test = concat_df.groupby('date')[['clicks', 'impressions']].sum()
            test = test.reset_index()
            test['ctr'] = test['clicks'] / test['impressions']
            return test
    
    
        def concat_control(self, dfs):
            """Concatenate the dataframes used for control
    
            Args:
                dfs (list): List of dataframes extracted from GSC
    
            Returns:
                dataframe: merged control dataframes. 1 metric column by df 
            """        
            control_data = []
            for i in range(len(dfs)):
                df = dfs[i][['date', self.metric]]
                df = df.rename(columns={self.metric: f'X{i}'})
                control_data.append(df)
            control = reduce(
                    lambda left, right: pd.merge(
                            left, right, on=['date'],
                            how='outer'),
                    control_data
                    )
            return control
    
    
        def create_master_df(self):
            """Create a master df for a given metric with:
            y = test (target)
            Xn = control (features)
    
            Returns:
                dataframe: df with target and features based on list of sites
            """        
            test = self.extract_from_list(self.test_sites)
            test = self.concat_test(test)
            y = test[['date', self.metric]].rename(columns={self.metric:'y'})
    
            if self.control_sites:
                control = self.extract_from_list(self.control_sites)
                X = self.concat_control(control)
                data = y.merge(X, on='date', how='left')
            else:
                data = y
            return data.sort_values(by='date').reset_index(drop=True)
    
    
        def get_pre_post(self, data):
            """Get pre-post periods based on the intervention date
    
            Args:
                data (dataframe): df comming from create_master_df()
    
            Returns:
                tuple: tuple of lists showing index edges of period before and after intervention
            """        
            pre_start = min(data.index)
            pre_end = int(data[data['date'] == self.intervention].index.values)
            post_start = pre_end + 1
            post_end = max(data.index)
    
            pre_period = [pre_start, pre_end] 
            post_period = [post_start, post_end]
            return pre_period, post_period
    
    
        def make_report(self, data, pre_period, post_period):
            """Creates the built-in CausalImpact report
    
            Args:
                data (dataframe): df comming from create_master_df()
                pre_period (list): list coming from get_pre_post()
                post_period (list): list coming from get_pre_post()
            """        
            ci = CausalImpact(data.drop(['date'], axis=1), pre_period, post_period)
            print(ci.summary())
            print(ci.summary(output='report'))
            ci.plot()
    
    

    Define the variables

    # define metric that you want to test
    # impressions, clicks, ctr
    metric = 'clicks' 
    
    # define intervention data
    intervention = '2021-08-01'
    
    # give the path of your credential file
    client_secrets = 'client_secrets.json'
    
    # define sites on which you ran the experiment (required)
    test_sites = [
        'https://ca.example.com/', 
        'https://us.example.com/', 
        'https://au.example.com/'
        ]
    
    # define control sites that were not shown the experiment (optional)
    # set list as empty [] to run simple pre-post experiment
    control_sites = [
        'https://www.example.fr/',
        'https://uk.example.com/'
        ]
    

    Run the code

    if __name__ == '__main__':
        account = authenticate(config=client_secrets)
        c = Causal(
            account,
            intervention,
            test_sites,
            control_sites=control_sites, 
            metric='clicks')
        c.run_causal()
    

    Full code

    from causalimpact import CausalImpact
    from functools import reduce
    import numpy as np
    import os
    import pandas as pd 
    import searchconsole
    
    # define metric that you want to test
    # impressions, clicks, ctr
    metric = 'clicks' 
    
    # define intervention data
    intervention = '2021-08-01'
    
    # give the path of your credential file
    client_secrets = 'client_secrets.json'
    
    # define sites on which you ran the experiment (required)
    test_sites = [
        'https://ca.example.com/', 
        'https://us.example.com/', 
        'https://au.example.com/'
        ]
    
    # define control sites that were not shown the experiment (optional)
    # set list as empty [] to run simple pre-post experiment
    control_sites = [
        'https://www.example.fr/',
        'https://uk.example.com/'
        ]
    
    
    class Causal:
    
    
        def __init__(self, account, intervention, test_sites, control_sites='', months=-16, metric='clicks', dimension='date'):
            self.account = account
            self.test_sites = test_sites
            self.control_sites = control_sites if control_sites else None
            self.intervention = intervention
            self.metric = metric
            self.months = months
            self.dimension = dimension
    
    
        def run_causal(self):
            """Combines all the functions together
    
            Returns:
                [df]: dataframe on which CI was run
            """        
            data = self.create_master_df()
            pre_period, post_period = self.get_pre_post(data)
            self.make_report(data, pre_period, post_period)
            return data
    
    
        def extract_from_list(self, sites):
            """Extract GSC data from a list of sites
    
            Args:
                sites (list): list of properties validated in GSC
    
            Returns:
                [list]: List of dataframes extracted from GSC
            """        
            print(f'Extracting data for {sites}')
            dfs = []
            for site in sites:
                print(f'Extracting: {site}')
                webproperty = self.account[site]
                report = webproperty.query\
                        .range('today', months=self.months)\
                        .dimension(self.dimension)\
                        .get()
                df = report.to_dataframe()
                dfs.append(df)
            return dfs
    
    
        def concat_test(self, dfs):
            """Concatenate the dataframes used for testing
    
            Args:
                dfs (list): List of dataframes extracted from GSC
    
            Returns:
                dataframe: merged test dataframes summed together 
            """        
            concat_df = pd.concat(dfs)
            test = concat_df.groupby('date')[['clicks', 'impressions']].sum()
            test = test.reset_index()
            test['ctr'] = test['clicks'] / test['impressions']
            return test
    
    
        def concat_control(self, dfs):
            """Concatenate the dataframes used for control
    
            Args:
                dfs (list): List of dataframes extracted from GSC
    
            Returns:
                dataframe: merged control dataframes. 1 metric column by df 
            """        
            control_data = []
            for i in range(len(dfs)):
                df = dfs[i][['date', self.metric]]
                df = df.rename(columns={self.metric: f'X{i}'})
                control_data.append(df)
            control = reduce(
                    lambda left, right: pd.merge(
                            left, right, on=['date'],
                            how='outer'),
                    control_data
                    )
            return control
    
    
        def create_master_df(self):
            """Create a master df for a given metric with:
            y = test (target)
            Xn = control (features)
    
            Returns:
                dataframe: df with target and features based on list of sites
            """        
            test = self.extract_from_list(self.test_sites)
            test = self.concat_test(test)
            y = test[['date', self.metric]].rename(columns={self.metric:'y'})
    
            if self.control_sites:
                control = self.extract_from_list(self.control_sites)
                X = self.concat_control(control)
                data = y.merge(X, on='date', how='left')
            else:
                data = y
            return data.sort_values(by='date').reset_index(drop=True)
    
    
        def get_pre_post(self, data):
            """Get pre-post periods based on the intervention date
    
            Args:
                data (dataframe): df comming from create_master_df()
    
            Returns:
                tuple: tuple of lists showing index edges of period before and after intervention
            """        
            pre_start = min(data.index)
            pre_end = int(data[data['date'] == self.intervention].index.values)
            post_start = pre_end + 1
            post_end = max(data.index)
    
            pre_period = [pre_start, pre_end] 
            post_period = [post_start, post_end]
            return pre_period, post_period
    
    
        def make_report(self, data, pre_period, post_period):
            """Creates the built-in CausalImpact report
    
            Args:
                data (dataframe): df comming from create_master_df()
                pre_period (list): list coming from get_pre_post()
                post_period (list): list coming from get_pre_post()
            """        
            ci = CausalImpact(data.drop(['date'], axis=1), pre_period, post_period)
            print(ci.summary())
            print(ci.summary(output='report'))
            ci.plot()
    
    
    def authenticate(config='client_secrets.json', token='credentials.json'):
        """Authenticate GSC"""
        if os.path.isfile(token):
            account = searchconsole.authenticate(client_config=config,
                                                credentials=token)
        else:
            account = searchconsole.authenticate(client_config=config,
                                            serialize=token)
        return account
    
    
    if __name__ == '__main__':
        account = authenticate(config=client_secrets)
        c = Causal(
            account,
            intervention,
            test_sites,
            control_sites=control_sites, 
            metric='clicks')
        c.run_causal()
    

    Beware When Using Causal Impact in SEO Experiments

    For those using Causal Impact for SEO experiments, results can be really precise, but can also be really wrong. There are many ways that can impact the quality of your predictions:

    • Size of the test data.
    • Length of the period prior to the experiment.
    • Choice of the control group to be compared against.
    • Seasonality hyperparameters.
    • Number of iterations.

    Read my article on Oncrawl on the subject:

    Evaluating The Quality Of CausalImpact Predictions

    Conclusion

    If you are not sure what the graph means, learn how to interpret CausalImpact graphs.

    Congratulations, you now have managed to use CausalImpact with Python using the pyCausalImpact package on your Google Search Console data.

    5/5 - (3 votes)