In this tutorial, you will learn how to fetch the wayback machine API with Python in order to retrieve HTML for a specific domain.
Shoutout to Antoine Eripret and LeeFoot for making much better tutorials than mine ๐
Import Relevant Python Libraries
In this tutorial, we will use the pandas library as well as requests and the time library to slow down the crawl in case of errors.
import pandas
import time
import requests
Use pip install if you have some missing library.
pip3 install requests pandas
Fetch the History of a Domain
The archive.org API has a search cdx endpoint to fetch all the urls for a domain.
You can replace with whatever domain you want to look at. The limit parameter allows you to reduce the number of results for large domains.
http://web.archive.org/cdx/search/cdx?url=jcchouinard.com*&output=json&limit=100
In Python, it looks like this.
domain = 'jcchouinard.com'
# Get all history for a domain
# Add &limit parameter if you work with a large site
all_history_endpoint = f'http://web.archive.org/cdx/search/cdx?url={domain}*&output=json'
r = requests.get(all_history_endpoint)
urls = r.json()
Previous boss of mine told me that I should mention other possible filters to limit the amount of extracted information. For instance, Mime Type allows to extract only text/html. Seems like a good idea to share the link to the filtering documentation here.
(Also, it seems like he is still bossing me around somehow ๐๐)
Get All Valid HTML Pages
Now, the wayback machine may fetch pages that are not HTML or even pages that return errors. We want to limit the results to the valid pages. It also fetches the same page at different points in time.
We will create a pandas DataFrame that contains only the latest valid version of the stored html. You can also adapt the list comprehension below to only add URLs that follow a specific URL pattern.
# Keep only HTML with status code 200
urls = [
(u[2].replace(':80','').replace('http:', 'https:'), u[1])
for u in urls
if u[3] == 'text/html' and u[4] == '200'
]
# Create a dataframe with the last screenshot's timestamp
# Drop all other "duplicate" URLs
df = pd.DataFrame(urls[1:], columns=['url','timestamp'])\
.sort_values(by='timestamp', ascending=False)\
.drop_duplicates(subset='url')\
.reset_index(drop=True)
Fetch every relevant URLs with the Archive.org API
Now, you need to fetch the page with requests to get the actual HTML screenshot for each URL.
Here is the Python code with some simple safeguarding.
Here we loop through each row, fetch the page, get the HTML, and store in a simple dictionary.
results = {}
for i in range(len(df)):
url, timestamp = df.loc[i]
print('fetching:', url)
wayback_url = f'http://web.archive.org/web/{timestamp}/{url}'
try:
html = requests.get(wayback_url)
if html.status_code == 200:
results[url] = {
'wayback_url':wayback_url,
'html': html.text,
'status_code': html.status_code
}
else:
results[url] = {
'wayback_url':wayback_url,
'html': 'status_code_error',
'status_code': html.status_code
}
except requests.exceptions.ConnectionError:
results[url] = {
'wayback_url':wayback_url,
'html': 'request_error',
'exception': 'Connection refused',
}
print('too many tries, sleeping for 10s...')
time.sleep(10)
Full Code
import pandas as pd
import time
import requests
domain = 'jcchouinard.com'
# Get all history for a domain
# Add &limit parameter if you work with a large site
all_history_endpoint = f'http://web.archive.org/cdx/search/cdx?url={domain}*&output=json'
r = requests.get(all_history_endpoint)
urls = r.json()
# Keep only HTML with status code 200
urls = [
(u[2].replace(':80','').replace('http:', 'https:'), u[1])
for u in urls
if u[3] == 'text/html' and u[4] == '200'
]
# Create a dataframe with the last screenshot's timestamp
# Drop all other "duplicate" URLs
df = pd.DataFrame(urls[1:], columns=['url','timestamp'])\
.sort_values(by='timestamp', ascending=False)\
.drop_duplicates(subset='url')\
.reset_index(drop=True)
results = {}
for i in range(len(df)):
url, timestamp = df.loc[i]
print('fetching:', url)
wayback_url = f'http://web.archive.org/web/{timestamp}/{url}'
try:
html = requests.get(wayback_url)
if html.status_code == 200:
results[url] = {
'wayback_url':wayback_url,
'html': html.text,
'status_code': html.status_code
}
else:
results[url] = {
'wayback_url':wayback_url,
'html': 'status_code_error',
'status_code': html.status_code
}
except requests.exceptions.ConnectionError:
results[url] = {
'wayback_url':wayback_url,
'html': 'request_error',
'exception': 'Connection refused',
}
print('too many tries, sleeping for 10s...')
time.sleep(10)
SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.