Alert Robots.txt Changes to Slack using Python

Share this post

This post is part of the complete guide on Slack API with Python

In this post, we will implement and smoke test to check whether the robots.txt file changed. When the robots.txt file is changed, it will back up the previous version and send an alert to the Slack API.

To set-up a robots.txt smoke test in Python:

  1. Create hygiene functions to manipulate data
  2. Read and use the credentials
  3. Fetch the robots.txt
  4. Compare to the previous version of the Robots.txt
  5. Send a message to Slack

Requirements

Hygiene Functions

A lot of my projects start with these functions that I use over and over.

  • create_project(): If a project does not exist, create it.
  • get_domain_name(): Converts the URL into a folder name.
  • fetch_page(): Uses the request library to fetch a page.
  • get_date(): Get today’s date.
from datetime import datetime
import os
import requests

from urllib.parse import urlparse

def create_project(directory):
    '''
    Create a project if it does not exist
    '''
    if not os.path.exists(directory):
        print('Create project: '+ directory)
        os.makedirs(directory)
    else:
        print(f'{directory} project exists')

def get_domain_name(start_url):
    '''
    Get Domain Name in the www_domain_com format
    1. Parse URL
    2. Get Domain
    3. Replace dots to make a usable folder path
    '''
    url = urlparse(start_url)               # Parse URL into components
    domain_name = url.netloc                # Get Domain (or network location)
    domain_name = domain_name.replace('.','_')# Replace . by _ to create usable folder
    domain_name = domain_name.split(':')[0]
    return domain_name

def fetch_page(url):
    '''
    Use requests to fetch URL
    '''
    try:
        r = requests.get(url)
    except Exception as e:
        print(e)
    return r

def get_date():
    '''
    Get today's date as YYYY-MM-DD
    '''
    return datetime.today().strftime('%Y-%m-%d:%H:%M%S')

Slack Functions

The get_credentials() function uses the json library to read the credentials file.

The post_to_slack() function adds the message json to be added to the requests header and the credentials.

import json

def get_credentials(credentials):
    '''
    Read credentials from JSON file.
    '''
    with open(credentials, 'r') as f:
        creds = json.load(f)
    return creds['slack_webhook']

def post_to_slack(message,credentials):
    '''
    Create header with the message and post to slack.
    '''
    data = {'text':message}
    url = get_credentials(credentials)
    requests.post(url,json=data, verify=False)

Robots.txt Functions

Next, we will set-up the functions to deal with the robots.txt.

  • get_robots_files(): List all the previous robots.txt files, sorted by date
  • get_robotstxt_name(): Get the latest saved robots.txt.
  • read_robotstxt(): Reads the robots.txt and returns the file.
  • write_robotstxt(): Backups the new robots.txt file.
def get_robots_files(directory):
    '''
    List robots.txt file in output folder
    '''
    output_files = os.listdir(directory)
    output_files.sort()
    return output_files

def get_robotstxt_name(output_files):
    '''
    Get last saved robots.txt file.
    '''
    robots_txt = 'robots.txt'
    files = []
    for filename in output_files:
        if robots_txt in filename:
            files.append(filename)
    return files[-1]

def read_robotstxt(filename):
    '''
    Read robots.txt
    '''
    if os.path.isfile(filename):
        with open(filename,'r') as f:
            txt = f.read()
    return txt

def write_robotstxt(file, output, filename='robots.txt'):
    '''
    Write robots.txt to file using date as identifier.
    '''
    filename = get_date() + '-' + filename
    filename = output + filename
    print(f'filename: {filename}')
    with open(filename,'w') as f:
        f.write(file)

Compare Current VS New Robots.txt

The compare_robots() function checks the latest saved robots.txt file, fetches the robots.txt in your site, and compares whether or not the file is the same.

You Might Also Like  Reddit API with Python (Complete Guide)

If it is new, or has changed, it will write the robots.txt that is currently on your site to a file that contains the date.

Then, it will use the post_to_slack() function to alert you of the change to the robots.txt.

If the file has not changed, the file is not saved, and a simple print function tells you that there were no new change.

def compare_robots(credentials,output,output_files,r):
    '''
    Compare previous robots to actual robots.
    If different. Save File.
    '''
    new_robotstxt = r.text.replace('\r', '')
    if output_files:
        robots_filename = get_robotstxt_name(output_files) 
        filename = output + robots_filename
        last_robotstxt = read_robotstxt(filename)
        last_robotstxt = last_robotstxt.replace('\r', '')
        if new_robotstxt != last_robotstxt:
            print('Robots.txt was modified')
            write_robotstxt(new_robotstxt, output)
            message = f'The Robots.txt was changed for {filename}'
            post_to_slack(message,credentials)
        else:
            print('No Change to Robots.txt')
    else:
        print('No Existing Robots.txt. Saving one.')
        write_robotstxt(r.text, output)

Run the Code

Let’s combine the entire code into a main() function.

site = 'https://www.jcchouinard.com/'

def main(credentials,site,filename='robots.txt'):
    '''
    Combine all functions
    '''
    url = site + filename
    domain = get_domain_name(site)
    output = path + '/output/' + domain + '/'
    create_project(output)
    r = fetch_page(url)
    output_files = get_robots_files(output)
    compare_robots(credentials,output,output_files,r)

To run the function, add the credentials first and then run the function we created above.

if __name__ == '__main__':
    path = os.getcwd()
    credentials = path + '/creds.json'
    main(credentials,site)

The if name equals main line checks whether you are running the module or importing it. If you are importing it, main() will not run.

Full Code

#!/usr/bin/env python
from datetime import datetime
import json
import os
import requests

from urllib.parse import urlparse

site = 'https://www.jcchouinard.com/'

def main(credentials,site,filename='robots.txt'):
    '''
    Combine all functions
    '''
    url = site + filename
    domain = get_domain_name(site)
    output = path + '/output/' + domain + '/'
    create_project(output)
    r = fetch_page(url)
    output_files = get_robots_files(output)
    compare_robots(credentials,output,output_files,r)

def get_credentials(credentials):
    '''
    Read credentials from JSON file.
    '''
    with open(credentials, 'r') as f:
        creds = json.load(f)
    return creds['slack_webhook']

def post_to_slack(message,credentials):
    '''
    Create header with the message and post to slack.
    '''
    data = {'text':message}
    url = get_credentials(credentials)
    requests.post(url,json=data, verify=False)

def create_project(directory):
    '''
    Create a project if it does not exist
    '''
    if not os.path.exists(directory):
        print('Create project: '+ directory)
        os.makedirs(directory)
    else:
        print(f'{directory} project exists')

def get_domain_name(start_url):
    '''
    Get Domain Name in the www_domain_com format
    1. Parse URL
    2. Get Domain
    3. Replace dots to make a usable folder path
    '''
    url = urlparse(start_url)               # Parse URL into components
    domain_name = url.netloc                # Get Domain (or network location)
    domain_name = domain_name.replace('.','_')# Replace . by _ to create usable folder
    domain_name = domain_name.split(':')[0]
    return domain_name

def fetch_page(url):
    '''
    Use requests to fetch URL
    '''
    try:
        r = requests.get(url)
    except Exception as e:
        print(e)
    return r

def get_robots_files(directory):
    '''
    List robots.txt file in output folder
    '''
    output_files = os.listdir(directory)
    output_files.sort()
    return output_files

def get_robotstxt_name(output_files):
    '''
    Get last saved robots.txt file.
    '''
    robots_txt = 'robots.txt'
    files = []
    for filename in output_files:
        if robots_txt in filename:
            files.append(filename)
    return files[-1]

def read_robotstxt(filename):
    '''
    Read robots.txt
    '''
    if os.path.isfile(filename):
        with open(filename,'r') as f:
            txt = f.read()
    return txt

def get_date():
    '''
    Get today's date as YYYY-MM-DD
    '''
    return datetime.today().strftime('%Y-%m-%d:%H:%M%S')

def write_robotstxt(file, output, filename='robots.txt'):
    '''
    Write robots.txt to file using date as identifier.
    '''
    filename = get_date() + '-' + filename
    filename = output + filename
    print(f'filename: {filename}')
    with open(filename,'w') as f:
        f.write(file)

def compare_robots(credentials,output,output_files,r):
    '''
    Compare previous robots to actual robots.
    If different. Save File.
    '''
    new_robotstxt = r.text.replace('\r', '')
    if output_files:
        robots_filename = get_robotstxt_name(output_files) 
        filename = output + robots_filename
        last_robotstxt = read_robotstxt(filename)
        last_robotstxt = last_robotstxt.replace('\r', '')
        if new_robotstxt != last_robotstxt:
            print('Robots.txt was modified')
            write_robotstxt(new_robotstxt, output)
            message = f'The Robots.txt was changed for {filename}'
            post_to_slack(message,credentials)
        else:
            print('No Change to Robots.txt')
    else:
        print('No Existing Robots.txt. Saving one.')
        write_robotstxt(r.text, output)

if __name__ == '__main__':
    path = os.getcwd()
    credentials = path + '/credentials.json'
    main(credentials,site)

Automate The Robots.txt Backups

To automate the robots.txt smoke tests and backups, you can learn how to schedule your python script on Windows task scheduler, or automate python script using CRON on Mac.

You Might Also Like  Slack API with Python (Complete Guide)

You can run your CRON task every day like this:

$ 30 8 * * * /path/to/python /path/to/file/check_robotstxt.py > /path/to/logger/robotstxtchecker.log 2>&1

This is it, you now have implemented a smoke test to check if the robots.txt file of your site has changed and alerted yourself with Python and the Slack API.