An accidental "Disallow: /" can happen to anyone, but it doesn't need to linger unnoticed. Whether you're a webmaster, developer, or SEO, this tool can help you quickly discover unwelcome robots.txt changes.
- Easily check results. Robots.txt check results are printed, logged, and optionally emailed (see below).
- Snapshots. The robots.txt content is saved following the first check and whenever the file changes.
- Diffs. A diff file is created after every change to help you view the difference at a glance.
- Email alerts (optional). Automatically notify site watchers about changes and send a summary email to the tool admin after every run. No need to run and check everything manually.
- Designed for reliability. Errors such as a mistyped URL or connection issue are handled gracefully and won't break other robots.txt checks. Unexpected issues are caught and logged.
- Comprehensive logging. Check results and any errors are recorded in a main log and website log (where relevant), so you can refer back to historic data and investigate if anything goes wrong.
- A /data directory is created/accessed to store robots.txt file data and logs.
- All monitored robots.txt files are downloaded and compared against the previous version.
- The check result is logged/printed and categorised as either "first run", "no change", "change", or "error".
- Timestamped snapshots ("first run" and "change") and diffs ("change") are saved in the relevant site directory.
- If enabled, site-specific email alerts are sent out ("first run", "change", and "error").
- If enabled, an administrative email alert is sent out detailing overall check results and any unexpected errors.
- Save the project to a new directory (locally or on a server).
- Locate the
example_config.py
file in the/app/
directory and create a newconfig.py
file in the same directory. Copy in the contents ofexample_config.py
(this will be filled in properly later). - Install all requirements using the Pipfile. For more information, please refer to the Pipenv documentation.
The quickest setup, suitable if you plan to run the tool on your local machine for sites that you're personally monitoring.
- Open
config.py
and setEMAILS_ENABLED = False
. - Create a CSV file named "monitored_sites.csv" in the same directory as the .py files.
- Add a header row (with column labels) and details of sites you want to monitor, as defined below:
- URL (column 1): the absolute URL of the website homepage, with a trailing slash.
- Name (column 2): the website's name identifier (letters/numbers only).
- Email (column 3): the email address of the site admin, who will receive alerts.
URL | Name | |
---|---|---|
https://github.com/ | Github | [email protected] |
-
Run
main.py
. The results will be printed and data/logs saved in a newly created /data subdirectory. -
Run
main.py
again whenever you want to re-check the robots.txt files. It's recommended that you check the print output or main log after every run, or at least after new sites are added, in case of unexpected errors.
Slightly more setup, suitable if you plan to run the tool on your local machine for yourself and others.
- Add the required details to
config.py
:
- Set
ADMIN_EMAIL
to equal an email address which will receive the summary report. - Set
SENDER_EMAIL
to equal a Gmail address which will send all emails. Less secure app access must be enabled. It's strongly recommended that you set up a new Google account for this. - Ensure
EMAILS_ENABLED = True
.
- Create a CSV file named "monitored_sites.csv" in the same directory as the .py files.
- Add a header row (with column labels) and details of sites you want to monitor, as defined below:
- URL (column 1): the absolute URL of the website homepage, with a trailing slash.
- Name (column 2): the website's name identifier (letters/numbers only).
- Email (column 3): the email address of the site admin, who will receive alerts.
URL | Name | |
---|---|---|
https://github.com/ | Github | [email protected] |
- Open
main.py
, uncommentemails.set_email_login()
, and save the file:
if __name__ == "__main__":
# Use set_email_login() to save login details on first run or if email/password changes:
emails.set_email_login()
main()
-
Run
main.py
. The results will be printed and data/logs saved in a newly created /data subdirectory. You will be prompted to enter yourSENDER_EMAIL
password, which will be saved for future use via Keyring. -
Re-comment
emails.set_email_login()
and save the file:
if __name__ == "__main__":
# Use set_email_login() to save login details on first run or if email/password changes:
# emails.set_email_login()
main()
- Run
main.py
again whenever you want to re-check the robots.txt files. It's recommended that you check the main log after every run, or at least after new sites are added, in case of unexpected errors.
More setup for a fully automated experience.
Refer to "Emails enabled, local", with the following considerations:
-
If Keyring isn't compatible with your server:
- Open
emails.py
and locate the following line withinsend_emails()
:with yagmail.SMTP(config.SENDER_EMAIL) as server:
- Edit this line to include the sender email password as the second argument of
yagmail.SMTP
:with yagmail.SMTP(config.SENDER_EMAIL, SENDER_EMAIL_PASSWORD) as server:
- Open
main.py
and ensureemails.set_email_login()
is commented out.
- Open
-
You may need to edit the shebang line at the top of
main.py
. -
You may need to edit the
PATH
variable inconfig.py
. -
It's strongly recommended that you test the cron job implementation is working correctly. To test that changes are being correctly detected/reported, you can edit
new_file.txt
within theprogram_files
subdirectory of a monitored site directory. On the following run, a change (versus the edited, "old" file) should be reported.
Feel free to create an issue or open a new discussion.
If the tool repeatedly reports an error or inaccurate robots.txt content for a site, but you're able to view the file via your browser, this is likely due to an invalid URL format or the tool being blocked in some way.
Try the following:
- Check that the monitored URL is in the correct format.
- Request that your IP address (or your server's IP address) is whitelisted.
- Try adjusting the
USER_AGENT
string (e.g. to spoof a common browser) inconfig.py
. - Troubleshoot with your IT/development teams.
There are no further updates/features planned, and I'm not looking for contributions, but I'll be happy to fix any (significant) bugs.