In this post, we will implement and smoke test to check whether the robots.txt file changed. When the robots.txt file is changed, it will back up the previous version and send an alert to the Slack API.
To set-up a robots.txt smoke test in Python:
- Create hygiene functions to manipulate data
- Read and use the credentials
- Fetch the robots.txt
- Compare to the previous version of the Robots.txt
- Send a message to Slack
Requirements
For this tutorial you will need:
- Have Python Installed (install python)
- Create a Slack Account (create account)
- Get your Slack API Webhooks
- Add the slack webhooks to a credentials.json file
Hygiene Functions
A lot of my projects start with these functions that I use over and over.
create_project()
: If a project does not exist, create it.get_domain_name()
: Converts the URL into a folder name.fetch_page()
: Uses the request library to fetch a page.get_date()
: Get today’s date.
from datetime import datetime
import os
import requests
from urllib.parse import urlparse
def create_project(directory):
'''
Create a project if it does not exist
'''
if not os.path.exists(directory):
print('Create project: '+ directory)
os.makedirs(directory)
else:
print(f'{directory} project exists')
def get_domain_name(start_url):
'''
Get Domain Name in the www_domain_com format
1. Parse URL
2. Get Domain
3. Replace dots to make a usable folder path
'''
url = urlparse(start_url) # Parse URL into components
domain_name = url.netloc # Get Domain (or network location)
domain_name = domain_name.replace('.','_')# Replace . by _ to create usable folder
domain_name = domain_name.split(':')[0]
return domain_name
def fetch_page(url):
'''
Use requests to fetch URL
'''
try:
r = requests.get(url)
except Exception as e:
print(e)
return r
def get_date():
'''
Get today's date as YYYY-MM-DD
'''
return datetime.today().strftime('%Y-%m-%d:%H:%M%S')
Slack Functions
The get_credentials()
function uses the json
library to read the credentials file.
The post_to_slack()
function adds the message json to be added to the requests header and the credentials.
import json
def get_credentials(credentials):
'''
Read credentials from JSON file.
'''
with open(credentials, 'r') as f:
creds = json.load(f)
return creds['slack_webhook']
def post_to_slack(message,credentials):
'''
Create header with the message and post to slack.
'''
data = {'text':message}
url = get_credentials(credentials)
requests.post(url,json=data, verify=False)
Robots.txt Functions
Next, we will set-up the functions to deal with the robots.txt.
get_robots_files()
: List all the previous robots.txt files, sorted by dateget_robotstxt_name()
: Get the latest saved robots.txt.read_robotstxt()
: Reads the robots.txt and returns the file.write_robotstxt()
: Backups the new robots.txt file.
def get_robots_files(directory):
'''
List robots.txt file in output folder
'''
output_files = os.listdir(directory)
output_files.sort()
return output_files
def get_robotstxt_name(output_files):
'''
Get last saved robots.txt file.
'''
robots_txt = 'robots.txt'
files = []
for filename in output_files:
if robots_txt in filename:
files.append(filename)
return files[-1]
def read_robotstxt(filename):
'''
Read robots.txt
'''
if os.path.isfile(filename):
with open(filename,'r') as f:
txt = f.read()
return txt
def write_robotstxt(file, output, filename='robots.txt'):
'''
Write robots.txt to file using date as identifier.
'''
filename = get_date() + '-' + filename
filename = output + filename
print(f'filename: {filename}')
with open(filename,'w') as f:
f.write(file)
Compare Current VS New Robots.txt
The compare_robots()
function checks the latest saved robots.txt file, fetches the robots.txt in your site, and compares whether or not the file is the same.
If it is new, or has changed, it will write the robots.txt that is currently on your site to a file that contains the date.
Then, it will use the post_to_slack()
function to alert you of the change to the robots.txt.
If the file has not changed, the file is not saved, and a simple print function tells you that there were no new change.
def compare_robots(credentials,output,output_files,r):
'''
Compare previous robots to actual robots.
If different. Save File.
'''
new_robotstxt = r.text.replace('\r', '')
if output_files:
robots_filename = get_robotstxt_name(output_files)
filename = output + robots_filename
last_robotstxt = read_robotstxt(filename)
last_robotstxt = last_robotstxt.replace('\r', '')
if new_robotstxt != last_robotstxt:
print('Robots.txt was modified')
write_robotstxt(new_robotstxt, output)
message = f'The Robots.txt was changed for {filename}'
post_to_slack(message,credentials)
else:
print('No Change to Robots.txt')
else:
print('No Existing Robots.txt. Saving one.')
write_robotstxt(r.text, output)
Run the Code
Let’s combine the entire code into a main()
function.
site = 'https://www.jcchouinard.com/'
def main(credentials,site,filename='robots.txt'):
'''
Combine all functions
'''
url = site + filename
domain = get_domain_name(site)
output = path + '/output/' + domain + '/'
create_project(output)
r = fetch_page(url)
output_files = get_robots_files(output)
compare_robots(credentials,output,output_files,r)
To run the function, add the credentials first and then run the function we created above.
if __name__ == '__main__':
path = os.getcwd()
credentials = path + '/creds.json'
main(credentials,site)
The if name equals main line checks whether you are running the module or importing it. If you are importing it, main()
will not run.
Full Code
#!/usr/bin/env python
from datetime import datetime
import json
import os
import requests
from urllib.parse import urlparse
site = 'https://www.jcchouinard.com/'
def main(credentials,site,filename='robots.txt'):
'''
Combine all functions
'''
url = site + filename
domain = get_domain_name(site)
output = path + '/output/' + domain + '/'
create_project(output)
r = fetch_page(url)
output_files = get_robots_files(output)
compare_robots(credentials,output,output_files,r)
def get_credentials(credentials):
'''
Read credentials from JSON file.
'''
with open(credentials, 'r') as f:
creds = json.load(f)
return creds['slack_webhook']
def post_to_slack(message,credentials):
'''
Create header with the message and post to slack.
'''
data = {'text':message}
url = get_credentials(credentials)
requests.post(url,json=data, verify=False)
def create_project(directory):
'''
Create a project if it does not exist
'''
if not os.path.exists(directory):
print('Create project: '+ directory)
os.makedirs(directory)
else:
print(f'{directory} project exists')
def get_domain_name(start_url):
'''
Get Domain Name in the www_domain_com format
1. Parse URL
2. Get Domain
3. Replace dots to make a usable folder path
'''
url = urlparse(start_url) # Parse URL into components
domain_name = url.netloc # Get Domain (or network location)
domain_name = domain_name.replace('.','_')# Replace . by _ to create usable folder
domain_name = domain_name.split(':')[0]
return domain_name
def fetch_page(url):
'''
Use requests to fetch URL
'''
try:
r = requests.get(url)
except Exception as e:
print(e)
return r
def get_robots_files(directory):
'''
List robots.txt file in output folder
'''
output_files = os.listdir(directory)
output_files.sort()
return output_files
def get_robotstxt_name(output_files):
'''
Get last saved robots.txt file.
'''
robots_txt = 'robots.txt'
files = []
for filename in output_files:
if robots_txt in filename:
files.append(filename)
return files[-1]
def read_robotstxt(filename):
'''
Read robots.txt
'''
if os.path.isfile(filename):
with open(filename,'r') as f:
txt = f.read()
return txt
def get_date():
'''
Get today's date as YYYY-MM-DD
'''
return datetime.today().strftime('%Y-%m-%d:%H:%M%S')
def write_robotstxt(file, output, filename='robots.txt'):
'''
Write robots.txt to file using date as identifier.
'''
filename = get_date() + '-' + filename
filename = output + filename
print(f'filename: {filename}')
with open(filename,'w') as f:
f.write(file)
def compare_robots(credentials,output,output_files,r):
'''
Compare previous robots to actual robots.
If different. Save File.
'''
new_robotstxt = r.text.replace('\r', '')
if output_files:
robots_filename = get_robotstxt_name(output_files)
filename = output + robots_filename
last_robotstxt = read_robotstxt(filename)
last_robotstxt = last_robotstxt.replace('\r', '')
if new_robotstxt != last_robotstxt:
print('Robots.txt was modified')
write_robotstxt(new_robotstxt, output)
message = f'The Robots.txt was changed for {filename}'
post_to_slack(message,credentials)
else:
print('No Change to Robots.txt')
else:
print('No Existing Robots.txt. Saving one.')
write_robotstxt(r.text, output)
if __name__ == '__main__':
path = os.getcwd()
credentials = path + '/credentials.json'
main(credentials,site)
Automate The Robots.txt Backups
To automate the robots.txt smoke tests and backups, you can learn how to schedule your python script on Windows task scheduler, or automate python script using CRON on Mac.
You can run your CRON task every day like this:
$ 30 8 * * * /path/to/python /path/to/file/check_robotstxt.py > /path/to/logger/robotstxtchecker.log 2>&1
Concolusion
This is it, you now have implemented a smoke test to check if the robots.txt file of your site has changed. Then, managed to alert yourself using Python and the Slack API.
SEO Strategist at Tripadvisor, ex- Seek (Melbourne, Australia). Specialized in technical SEO. Writer in Python, Information Retrieval, SEO and machine learning. Guest author at SearchEngineJournal, SearchEngineLand and OnCrawl.