SEO Experiments-Create Test and Control Groups (with Python)

In this tutorial, you will see an example of the kind of logic that you can use to create comparable control and test group splits for SEO experiments.

The main goal is to have groups of pages that are very similar so that you can successfully run SEO experiments.

  • Find groups of pages that correlate to each other
  • Remove outlier pages (e.g. Any page that have seen very large spikes)
  • Smooth out outlier days: Any single date that have seen unusual large variation.
  • Use longer time frame when creating your splits (3 years is better than 1)

Importance of Having Similar Test and Control Groups

Having test and control groups that are not similar enough can lead to lower quality of predictions, with examples showing up to 20% variation, when there was in fact no change (false positive). This can ultimately lead to the wrong decisions being made.

Join the Newsletter

    Example Code to Create SEO Testing Splits

    This code is meant as an example of how you can create SEO testing splits, but should be improved using your own data, and fine tuned through hyper parameter tuning.

    It is meant to help you understand the logic, not to build in production. Most of it was done vibe coding in Gemini, and only validated by me to make sure the logic is the right one.

    Here is what it does.

    test_splitter.py takes in a sample.csv file with 3 columns:

    • date,
    • url,
    • clicks

    Then, here is the main logic.

    1. Loads historical website traffic data from a CSV.
    2. Excludes pages with too many data outliers (“Unused”).
    3. Iteratively shuffles the remaining pages into test/control groups.
    4. Finds the most statistically similar group pairing (best correlation).
    5. Generates a detailed quality assessment report for the split.
    6. Saves the final group assignments to a new CSV.
    7. Logs all text output to a .txt file.
    8. Creates charts comparing the selected groups’ performance over time.

    See full code on Github.

    Evaluate the Results

    If you need to evaluate your results, you can then use Causal Impact or learn about techniques used by Tripadvisor. See my tutorial on Causal Impact with Python.

    Other SEO Testing Guidelines

    Here are a few principle that you should always be aware when running SEO experiments.

    Main Reasons Why SEO Experimentation May be Wrong

    1. Data source is wrong.
    2. Test and Control are not similar enough
    3. Not enough traffic
    4. Not enough sample pages
    5. Too many confounding factors (algorithm updates, major event)
    6. Not enough re-testing

    Real World Data is Messy

    The above screenshots show how a fake sample can be highly correlated, but if you run a similar thing on real workd data such as Wikipedia traffic data you will find how correlation can become much lower.

    Output

    This will return a visualization of your splits. An assignments.csv file with the details of your URLs and splits and an analysis_log.txt that returns the detail from the code execution.

    Visualizations

    Assignments

                               URL    Group
    0    https://example.com/page0     Test
    1    https://example.com/page1     Test
    2   https://example.com/page10  Control
    3   https://example.com/page11  Control
    4   https://example.com/page12  Control
    ...
    
    

    Analysis_log

    Step 1: Loading data from 'sample.csv'...
    ✅ Data loaded and prepared.
    
    Step 2: Filtering data to the last 2 years...
    ✅ Data filtered. Using records from 2023-09-23 to today.
    
    Step 3: Identifying pages with >20% outliers to exclude...
    ✅ Found 0 pages to mark as 'Unused'.
    ✅ Proceeding with 100 usable pages.
    
    Step 4: Performing 10 shuffles to find the best Test/Control split...
      Shuffle 1/10: Similarity (Pearson correlation) = 0.9035
      Shuffle 2/10: Similarity (Pearson correlation) = 0.8901
      Shuffle 3/10: Similarity (Pearson correlation) = 0.9032
      Shuffle 4/10: Similarity (Pearson correlation) = 0.8909
      Shuffle 5/10: Similarity (Pearson correlation) = 0.8936
      Shuffle 6/10: Similarity (Pearson correlation) = 0.9006
      Shuffle 7/10: Similarity (Pearson correlation) = 0.9146
      Shuffle 8/10: Similarity (Pearson correlation) = 0.9010
      Shuffle 9/10: Similarity (Pearson correlation) = 0.8967
      Shuffle 10/10: Similarity (Pearson correlation) = 0.9066
    ✅ Found best split with a correlation of 0.9146.
    
    Step 5: Generating final outputs for the top-ranked split...
    --- OUTPUT 1: Split Quality Assessment ---
    📊 Correlation Score: 0.9146
       Assessment: Good. The groups are well-correlated.
    
    📈 Minimum Detectable Effect (MDE):
       With 80% power and 95% confidence, you would need to see a lift of at least 2.54% to declare a statistically significant result.
       (This corresponds to an average daily click change of ~442 clicks.)
    
    ✅ Final Verdict: The experimental setup is viable.
    
    ...
    
    

    Full Code

    import pandas as pd
    import numpy as np
    from scipy.stats import pearsonr, t
    from datetime import datetime, timedelta
    import matplotlib.pyplot as plt
    import seaborn as sns
    import sys
    
    def load_and_prepare_data(filename="sample.csv"):
        """
        Loads data from the specified CSV file and prepares it for analysis.
        """
        print(f"Step 1: Loading data from '{filename}'...")
        try:
            df = pd.read_csv(filename)
            df['date'] = pd.to_datetime(df['date'])
            print("✅ Data loaded and prepared.\n")
            return df
        except FileNotFoundError:
            print(f"Error: The file '{filename}' was not found.")
            print("Please ensure the CSV file is in the same directory as the script.")
            return None
    
    def find_best_split(df_filtered, num_shuffles=10, outlier_threshold=0.20):
        """
        UPDATED: Performs shuffles, stores all results, and returns them sorted by score.
        """
        print(f"Step 3: Identifying pages with >{outlier_threshold:.0%} outliers to exclude...")
        
        page_stats = {}
        for url, group in df_filtered.groupby('url'):
            q1 = group['clicks'].quantile(0.25)
            q3 = group['clicks'].quantile(0.75)
            iqr = q3 - q1
            lower_bound = q1 - 1.5 * iqr
            upper_bound = q3 + 1.5 * iqr
            
            outliers = group[(group['clicks'] < lower_bound) | (group['clicks'] > upper_bound)]
            outlier_ratio = len(outliers) / len(group)
            page_stats[url] = outlier_ratio
    
        unused_urls = {url for url, ratio in page_stats.items() if ratio > outlier_threshold}
        usable_urls = list(set(df_filtered['url']) - unused_urls)
        
        print(f"✅ Found {len(unused_urls)} pages to mark as 'Unused'.")
        print(f"✅ Proceeding with {len(usable_urls)} usable pages.\n")
    
        print(f"Step 4: Performing {num_shuffles} shuffles to find the best Test/Control split...")
        
        all_splits = [] # Store results of all shuffles
    
        for i in range(num_shuffles):
            np.random.shuffle(usable_urls)
            midpoint = len(usable_urls) // 2
            test_urls = usable_urls[:midpoint]
            control_urls = usable_urls[midpoint:]
    
            test_df = df_filtered[df_filtered['url'].isin(test_urls)]
            control_df = df_filtered[df_filtered['url'].isin(control_urls)]
            
            test_ts = test_df.groupby('date')['clicks'].sum()
            control_ts = control_df.groupby('date')['clicks'].sum()
            
            def impute_outliers_with_mean(series):
                q1, q3 = series.quantile(0.25), series.quantile(0.75)
                iqr = q3 - q1
                series_mean = series.mean()
                lower_bound = q1 - 1.5 * iqr
                upper_bound = q3 + 1.5 * iqr
                return series.where((series >= lower_bound) & (series <= upper_bound), series_mean)
    
            test_ts_smoothed = impute_outliers_with_mean(test_ts)
            control_ts_smoothed = impute_outliers_with_mean(control_ts)
    
            common_index = test_ts_smoothed.index.intersection(control_ts_smoothed.index)
            if len(common_index) > 1:
                corr, _ = pearsonr(test_ts_smoothed[common_index], control_ts_smoothed[common_index])
                print(f"  Shuffle {i+1}/{num_shuffles}: Similarity (Pearson correlation) = {corr:.4f}")
                
                # Append the result of this shuffle
                all_splits.append({
                    'score': corr,
                    'test_urls': test_urls,
                    'control_urls': control_urls
                })
            else:
                print(f"  Shuffle {i+1}/{num_shuffles}: Not enough data to compare.")
        
        # Sort all splits by score in descending order
        all_splits.sort(key=lambda x: x['score'], reverse=True)
        
        if all_splits:
            print(f"✅ Found best split with a correlation of {all_splits[0]['score']:.4f}.\n")
        
        return all_splits, unused_urls
    
    def generate_assessment_and_table(best_split, unused_urls, all_urls_df):
        """
        Generates the final assessment report and the assignment table.
        """
        print("Step 5: Generating final outputs for the top-ranked split...")
        
        correlation = best_split['score']
        
        if correlation > 0.95:
            correlation_assessment = "Excellent. The groups are highly correlated and suitable for testing."
        elif correlation > 0.90:
            correlation_assessment = "Good. The groups are well-correlated."
        else:
            correlation_assessment = "Fair. The groups have a moderate correlation. Be cautious with results."
    
        alpha, power = 0.05, 0.80
        test_ts = all_urls_df[all_urls_df['url'].isin(best_split['test_urls'])].groupby('date')['clicks'].sum()
        control_ts = all_urls_df[all_urls_df['url'].isin(best_split['control_urls'])].groupby('date')['clicks'].sum()
        
        pooled_std = np.sqrt((np.var(test_ts) + np.var(control_ts)) / 2)
        n = len(test_ts)
        t_alpha = t.ppf(1 - alpha / 2, df=2 * n - 2)
        t_beta = t.ppf(power, df=2 * n - 2)
        
        mde_absolute = (t_alpha + t_beta) * pooled_std * np.sqrt(2 / n)
        mde_relative = mde_absolute / control_ts.mean()
        
        verdict = "The experimental setup is viable." if correlation > 0.90 else "The experimental setup is risky. The low correlation may obscure results."
    
        print("--- OUTPUT 1: Split Quality Assessment ---")
        print(f"📊 Correlation Score: {correlation:.4f}")
        print(f"   Assessment: {correlation_assessment}\n")
        print(f"📈 Minimum Detectable Effect (MDE):")
        print(f"   With 80% power and 95% confidence, you would need to see a lift of at least {mde_relative:.2%} to declare a statistically significant result.")
        print(f"   (This corresponds to an average daily click change of ~{mde_absolute:.0f} clicks.)\n")
        print(f"✅ Final Verdict: {verdict}")
        print("------------------------------------------\n")
        
        all_urls = list(all_urls_df['url'].unique())
        assignments = []
        for url in all_urls:
            if url in best_split['test_urls']: group = "Test"
            elif url in best_split['control_urls']: group = "Control"
            else: group = "Unused"
            assignments.append({'URL': url, 'Group': group})
            
        assignment_df = pd.DataFrame(assignments)
        
        print("--- OUTPUT 2: Assignment Table ---")
        print(assignment_df.to_string())
        print("------------------------------------\n")
    
        return assignment_df
    
    def plot_top_n_shuffles(all_splits, df_filtered, n=3):
        """
        NEW: Creates side-by-side plots for the top n shuffles.
        """
        print(f"Step 6: Generating side-by-side comparison for top {n} splits...")
        
        n = min(n, len(all_splits)) 
        if n == 0:
            print("No splits to plot.")
            return
    
        fig, axes = plt.subplots(n, 1, figsize=(12, 5 * n), sharex=True)
        if n == 1: 
            axes = [axes]
    
        for i in range(n):
            split_data = all_splits[i]
            ax = axes[i]
            
            test_ts = df_filtered[df_filtered['url'].isin(split_data['test_urls'])].groupby('date')['clicks'].sum().rolling(window=7).mean()
            control_ts = df_filtered[df_filtered['url'].isin(split_data['control_urls'])].groupby('date')['clicks'].sum().rolling(window=7).mean()
            
            ax.plot(test_ts.index, test_ts.values, label='Test Group', color='#007acc', linewidth=2)
            ax.plot(control_ts.index, control_ts.values, label='Control Group', color='#ff6347', linewidth=2)
            
            ax.set_title(f"Rank {i+1}: Correlation = {split_data['score']:.4f}", fontsize=14)
            ax.set_ylabel("7-Day Rolling Avg Clicks")
            ax.legend()
            ax.grid(True)
    
        plt.xlabel("Date", fontsize=12)
        fig.suptitle("Top N Shuffle Comparisons", fontsize=18, y=1.02)
        plt.tight_layout()
        plt.show()
    
    def plot_final_split(best_split, unused_urls, df_filtered):
        """
        Generates a detailed plot for the top-ranked split, including the unused group.
        """
        print("Step 7: Generating detailed plot for the top-ranked split...")
    
        test_urls, control_urls = best_split['test_urls'], best_split['control_urls']
        test_ts_rolling = df_filtered[df_filtered['url'].isin(test_urls)].groupby('date')['clicks'].sum().rolling(window=7).mean()
        control_ts_rolling = df_filtered[df_filtered['url'].isin(control_urls)].groupby('date')['clicks'].sum().rolling(window=7).mean()
    
        plt.style.use('seaborn-v0_8-whitegrid')
        fig, ax = plt.subplots(figsize=(14, 7))
    
        ax.plot(test_ts_rolling.index, test_ts_rolling.values, label='Test Group (7-Day Rolling Avg)', color='#007acc', linewidth=2)
        ax.plot(control_ts_rolling.index, control_ts_rolling.values, label='Control Group (7-Day Rolling Avg)', color='#ff6347', linewidth=2)
        
        if unused_urls:
            unused_ts_rolling = df_filtered[df_filtered['url'].isin(unused_urls)].groupby('date')['clicks'].sum().rolling(window=7).mean()
            ax.plot(unused_ts_rolling.index, unused_ts_rolling.values, label='Unused Group (7-Day Rolling Avg)', color='grey', linestyle=':', linewidth=1.5)
    
        ax.set_title(f'Top-Ranked Split: 7-Day Rolling Average (Correlation: {best_split["score"]:.4f})', fontsize=16)
        ax.set_xlabel('Date', fontsize=12)
        ax.set_ylabel('Total Daily Clicks', fontsize=12)
        ax.legend(fontsize=10)
        ax.tick_params(axis='x', rotation=45)
        plt.tight_layout()
        plt.show()
    
    def plot_relative_performance(best_split, df_filtered):
        """
        Generates a relative performance plot for the top-ranked split.
        """
        print("Step 8: Generating relative performance plot for the top-ranked split...")
    
        test_urls, control_urls = best_split['test_urls'], best_split['control_urls']
        test_ts_rolling = df_filtered[df_filtered['url'].isin(test_urls)].groupby('date')['clicks'].sum().rolling(window=7).mean()
        control_ts_rolling = df_filtered[df_filtered['url'].isin(control_urls)].groupby('date')['clicks'].sum().rolling(window=7).mean()
    
        relative_performance = ((test_ts_rolling - control_ts_rolling) / control_ts_rolling) * 100
        relative_performance.replace([np.inf, -np.inf], np.nan, inplace=True)
        relative_performance.dropna(inplace=True)
    
        plt.style.use('seaborn-v0_8-whitegrid')
        fig, ax = plt.subplots(figsize=(14, 7))
    
        ax.plot(relative_performance.index, relative_performance.values, color='black', linewidth=1.5)
        ax.axhline(0, color='grey', linestyle='--', linewidth=1)
        ax.fill_between(relative_performance.index, relative_performance.values, 0, where=relative_performance >= 0, facecolor='green', alpha=0.3)
        ax.fill_between(relative_performance.index, relative_performance.values, 0, where=relative_performance < 0, facecolor='red', alpha=0.3)
        ax.set_title('Top-Ranked Split: Test Group Relative Performance', fontsize=16)
        ax.set_xlabel('Date', fontsize=12)
        ax.set_ylabel('Performance Difference (%)', fontsize=12)
        ax.tick_params(axis='x', rotation=45)
        plt.tight_layout()
        plt.show()
    
    def save_assignments_to_csv(assignment_df, filename='./outputs/assignments.csv'):
        """
        Saves the final assignment DataFrame to a CSV file.
        """
        try:
            assignment_df.to_csv(filename, index=False)
            print(f"✅ Assignment table saved to '{filename}'.")
        except Exception as e:
            print(f"Error saving assignment file: {e}")
    
    # --- Main Execution ---
    if __name__ == "__main__":
        log_filename = './outputs/analysis_log.txt'
        
        with open(log_filename, 'w') as log_file:
            original_stdout = sys.stdout
            sys.stdout = log_file
            
            raw_df = load_and_prepare_data(filename="sample.csv")
            
            plot_data = None
            
            if raw_df is not None:
                print("Step 2: Filtering data to the last 2 years...")
                two_years_ago = datetime.now() - timedelta(days=730)
                df_filtered = raw_df[raw_df['date'] >= two_years_ago].copy()
                print(f"✅ Data filtered. Using records from {two_years_ago.date()} to today.\n")
                
                # UPDATED: find_best_split now returns a list of all splits
                all_splits_info, unused_urls_list = find_best_split(df_filtered)
                
                if all_splits_info:
                    # The best split is the first one in the sorted list
                    best_split_info = all_splits_info[0]
                    
                    final_table = generate_assessment_and_table(best_split_info, unused_urls_list, raw_df)
                    save_assignments_to_csv(final_table)
                    
                    # Store all data needed for the plots
                    plot_data = {
                        'all_splits': all_splits_info,
                        'best_split': best_split_info, 
                        'unused_urls': unused_urls_list, 
                        'df_filtered': df_filtered
                    }
                else:
                    print("Could not find any suitable splits. Please check the input data.")
            
            sys.stdout = original_stdout
    
        print(f"✅ Analysis complete. Log saved to '{log_filename}'.")
    
        if plot_data:
            # NEW: Call the side-by-side comparison plot first
            plot_top_n_shuffles(plot_data['all_splits'], plot_data['df_filtered'], n=3)
            
            # Then, show the detailed plots for the #1 split
            plot_final_split(plot_data['best_split'], plot_data['unused_urls'], plot_data['df_filtered'])
            plot_relative_performance(plot_data['best_split'], plot_data['df_filtered'])
    
    Enjoyed This Post?