Wayback Machine (Archive.org) API with Python

In this tutorial, you will learn how to fetch the wayback machine API with Python in order to retrieve HTML for a specific domain.

Shoutout to Antoine Eripret and LeeFoot for making much better tutorials than mine ๐Ÿ™‚

Import Relevant Python Libraries

In this tutorial, we will use the pandas library as well as requests and the time library to slow down the crawl in case of errors.

Join the Newsletter

โ€‹

    โ€‹

    import pandas
    import time
    import requests
    

    Use pip install if you have some missing library.

    pip3 install requests pandas

    Fetch the History of a Domain

    The archive.org API has a search cdx endpoint to fetch all the urls for a domain.

    You can replace with whatever domain you want to look at. The limit parameter allows you to reduce the number of results for large domains.

    http://web.archive.org/cdx/search/cdx?url=jcchouinard.com*&output=json&limit=100

    In Python, it looks like this.

    domain = 'jcchouinard.com'
    
    # Get all history for a domain
    # Add &limit parameter if you work with a large site
    all_history_endpoint = f'http://web.archive.org/cdx/search/cdx?url={domain}*&output=json'
    r = requests.get(all_history_endpoint)
    urls = r.json()
    

    Previous boss of mine told me that I should mention other possible filters to limit the amount of extracted information. For instance, Mime Type allows to extract only text/html. Seems like a good idea to share the link to the filtering documentation here.

    (Also, it seems like he is still bossing me around somehow ๐Ÿ˜‚๐Ÿ˜‚)

    Get All Valid HTML Pages

    Now, the wayback machine may fetch pages that are not HTML or even pages that return errors. We want to limit the results to the valid pages. It also fetches the same page at different points in time.

    We will create a pandas DataFrame that contains only the latest valid version of the stored html. You can also adapt the list comprehension below to only add URLs that follow a specific URL pattern.

    # Keep only HTML with status code 200
    urls = [
        (u[2].replace(':80','').replace('http:', 'https:'), u[1]) 
        for u in urls 
        if u[3] == 'text/html' and u[4] == '200'
        ]
    
    # Create a dataframe with the last screenshot's timestamp
    # Drop all other "duplicate" URLs
    df = pd.DataFrame(urls[1:], columns=['url','timestamp'])\
        .sort_values(by='timestamp', ascending=False)\
            .drop_duplicates(subset='url')\
                .reset_index(drop=True)
    

    Fetch every relevant URLs with the Archive.org API

    Now, you need to fetch the page with requests to get the actual HTML screenshot for each URL.

    Here is the Python code with some simple safeguarding.

    Here we loop through each row, fetch the page, get the HTML, and store in a simple dictionary.

    results = {}
    
    for i in range(len(df)):
        url, timestamp = df.loc[i]
        print('fetching:', url)
        wayback_url = f'http://web.archive.org/web/{timestamp}/{url}'
        try:
            html = requests.get(wayback_url)
            if html.status_code == 200:
                results[url] = {
                    'wayback_url':wayback_url,
                    'html': html.text,
                    'status_code': html.status_code
                }
            else:
                results[url] = {
                    'wayback_url':wayback_url,
                    'html': 'status_code_error',
                    'status_code': html.status_code
                }
        except requests.exceptions.ConnectionError:
            results[url] = {
                    'wayback_url':wayback_url,
                    'html': 'request_error',
                    'exception': 'Connection refused',
                }
            print('too many tries, sleeping for 10s...')
            time.sleep(10)
    

    Full Code

    import pandas as pd
    import time 
    import requests 
    
    domain = 'jcchouinard.com'
    
    # Get all history for a domain
    # Add &limit parameter if you work with a large site
    all_history_endpoint = f'http://web.archive.org/cdx/search/cdx?url={domain}*&output=json'
    r = requests.get(all_history_endpoint)
    urls = r.json()
    
    # Keep only HTML with status code 200
    urls = [
        (u[2].replace(':80','').replace('http:', 'https:'), u[1]) 
        for u in urls 
        if u[3] == 'text/html' and u[4] == '200'
        ]
    
    # Create a dataframe with the last screenshot's timestamp
    # Drop all other "duplicate" URLs
    df = pd.DataFrame(urls[1:], columns=['url','timestamp'])\
        .sort_values(by='timestamp', ascending=False)\
            .drop_duplicates(subset='url')\
                .reset_index(drop=True)
    
    results = {}
    
    for i in range(len(df)):
        url, timestamp = df.loc[i]
        print('fetching:', url)
        wayback_url = f'http://web.archive.org/web/{timestamp}/{url}'
        try:
            html = requests.get(wayback_url)
            if html.status_code == 200:
                results[url] = {
                    'wayback_url':wayback_url,
                    'html': html.text,
                    'status_code': html.status_code
                }
            else:
                results[url] = {
                    'wayback_url':wayback_url,
                    'html': 'status_code_error',
                    'status_code': html.status_code
                }
        except requests.exceptions.ConnectionError:
            results[url] = {
                    'wayback_url':wayback_url,
                    'html': 'request_error',
                    'exception': 'Connection refused',
                }
            print('too many tries, sleeping for 10s...')
            time.sleep(10)
    
    
    Enjoyed This Post?