Wayback Machine (Archive.org) API with Python

In this tutorial, you will learn how to fetch the wayback machine API with Python in order to retrieve HTML for a specific domain.

Shoutout to Antoine Eripret and LeeFoot for making much better tutorials than mine ๐Ÿ™‚

Import Relevant Python Libraries

In this tutorial, we will use the pandas library as well as requests and the time library to slow down the crawl in case of errors.


Subscribe to my Newsletter


import pandas
import time
import requests

Use pip install if you have some missing library.

pip3 install requests pandas

Fetch the History of a Domain

The archive.org API has a search cdx endpoint to fetch all the urls for a domain.

You can replace with whatever domain you want to look at. The limit parameter allows you to reduce the number of results for large domains.

http://web.archive.org/cdx/search/cdx?url=jcchouinard.com*&output=json&limit=100

In Python, it looks like this.

domain = 'jcchouinard.com'

# Get all history for a domain
# Add &limit parameter if you work with a large site
all_history_endpoint = f'http://web.archive.org/cdx/search/cdx?url={domain}*&output=json'
r = requests.get(all_history_endpoint)
urls = r.json()

Previous boss of mine told me that I should mention other possible filters to limit the amount of extracted information. For instance, Mime Type allows to extract only text/html. Seems like a good idea to share the link to the filtering documentation here.

(Also, it seems like he is still bossing me around somehow ๐Ÿ˜‚๐Ÿ˜‚)

Get All Valid HTML Pages

Now, the wayback machine may fetch pages that are not HTML or even pages that return errors. We want to limit the results to the valid pages. It also fetches the same page at different points in time.

We will create a pandas DataFrame that contains only the latest valid version of the stored html. You can also adapt the list comprehension below to only add URLs that follow a specific URL pattern.

# Keep only HTML with status code 200
urls = [
    (u[2].replace(':80','').replace('http:', 'https:'), u[1]) 
    for u in urls 
    if u[3] == 'text/html' and u[4] == '200'
    ]

# Create a dataframe with the last screenshot's timestamp
# Drop all other "duplicate" URLs
df = pd.DataFrame(urls[1:], columns=['url','timestamp'])\
    .sort_values(by='timestamp', ascending=False)\
        .drop_duplicates(subset='url')\
            .reset_index(drop=True)

Fetch every relevant URLs with the Archive.org API

Now, you need to fetch the page with requests to get the actual HTML screenshot for each URL.

Here is the Python code with some simple safeguarding.

Here we loop through each row, fetch the page, get the HTML, and store in a simple dictionary.

results = {}

for i in range(len(df)):
    url, timestamp = df.loc[i]
    print('fetching:', url)
    wayback_url = f'http://web.archive.org/web/{timestamp}/{url}'
    try:
        html = requests.get(wayback_url)
        if html.status_code == 200:
            results[url] = {
                'wayback_url':wayback_url,
                'html': html.text,
                'status_code': html.status_code
            }
        else:
            results[url] = {
                'wayback_url':wayback_url,
                'html': 'status_code_error',
                'status_code': html.status_code
            }
    except requests.exceptions.ConnectionError:
        results[url] = {
                'wayback_url':wayback_url,
                'html': 'request_error',
                'exception': 'Connection refused',
            }
        print('too many tries, sleeping for 10s...')
        time.sleep(10)

Full Code

import pandas as pd
import time 
import requests 

domain = 'jcchouinard.com'

# Get all history for a domain
# Add &limit parameter if you work with a large site
all_history_endpoint = f'http://web.archive.org/cdx/search/cdx?url={domain}*&output=json'
r = requests.get(all_history_endpoint)
urls = r.json()

# Keep only HTML with status code 200
urls = [
    (u[2].replace(':80','').replace('http:', 'https:'), u[1]) 
    for u in urls 
    if u[3] == 'text/html' and u[4] == '200'
    ]

# Create a dataframe with the last screenshot's timestamp
# Drop all other "duplicate" URLs
df = pd.DataFrame(urls[1:], columns=['url','timestamp'])\
    .sort_values(by='timestamp', ascending=False)\
        .drop_duplicates(subset='url')\
            .reset_index(drop=True)

results = {}

for i in range(len(df)):
    url, timestamp = df.loc[i]
    print('fetching:', url)
    wayback_url = f'http://web.archive.org/web/{timestamp}/{url}'
    try:
        html = requests.get(wayback_url)
        if html.status_code == 200:
            results[url] = {
                'wayback_url':wayback_url,
                'html': html.text,
                'status_code': html.status_code
            }
        else:
            results[url] = {
                'wayback_url':wayback_url,
                'html': 'status_code_error',
                'status_code': html.status_code
            }
    except requests.exceptions.ConnectionError:
        results[url] = {
                'wayback_url':wayback_url,
                'html': 'request_error',
                'exception': 'Connection refused',
            }
        print('too many tries, sleeping for 10s...')
        time.sleep(10)

Enjoyed This Post?