Scrutiny to JSON Import

Data from the Scrutiny export can be converted into a more usable format by running the included script:

$ cd search_gov_crawler/search_gov_spiders/utility_files
$ python import_plist.py --input_file ./scrutiny-2023-06-20.plist

Job Schedule Calendar

To start I have spread jobs throughout the day. I did not give any consideration to how long individual jobs run so this may need to be adjusted to allow for very long running jobs. All times are UTC. A maintenance window has been established each Wednesday between 1500 and 2100 so we can do releases without extra enabling/disabling of jobs.

✔️ - Scrape job scheduled ❌ - Maintenance window, do not schedule (Wed 1530-

UTC	Mon	Tue	Wed	Thu	Fri
0030
0100
0130
0200
0230
0300
0330	✔️	✔️	✔️	✔️	✔️
0400
0430
0500
0530	✔️	✔️	✔️	✔️	✔️
0600
0630
0700
0730	✔️	✔️	✔️	✔️	✔️
0800
0830
0900
0930	✔️	✔️	✔️	✔️	✔️
1000
1030
1100
1130	✔️	✔️	✔️	✔️	✔️
1200
1230
1300
1330	✔️	✔️	✔️	✔️	✔️
1400
1430
1500			❌
1530	✔️	✔️	❌	✔️	✔️
1600			❌
1630			❌
1700			❌
1730	✔️	✔️	❌	✔️	✔️
1800			❌
1830			❌
1900			❌
1930	✔️	✔️	❌	✔️	✔️
2000			❌
2030			❌
2100
2130	✔️	✔️	✔️	✔️
2200
2230
2300
2330

Schedule Import

To initialize the above schedule in scrapydweb follow these instructions:

Run scrapyd, logparser, and scrapydweb using directions in main README file.
Apply schedule to database by running the init script $ cd search_gov_crawler/search_gov_spiders/utility_files $ python init_schedule.py --input_file=./crawl-sites.json
If there are any issues finding the database files, you may need to set the DATA_PATH environment variable. If you have not set DATA_PATH as an environment variable or in scrapydweb config, it will be set to <venv-path>/lib/python3.12/site-packages/scrapydweb/data/database
Refresh scrapydweb Timer Tasks page
All tasks will initially be in an inactive state. To reactivate:
- Click Edit button on any task
- Update action field from Add Task & Fire Right Now to Add Task
- Update name field to remove - edit that has been added to end of task name
- Click Check CMD button
- Click [update] Add Task button
- Verify task has a value for Next run time on the Timer Task page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Scrutiny to JSON Import

Job Schedule Calendar

Schedule Import

Files

README.md

Latest commit

History

README.md

File metadata and controls

Scrutiny to JSON Import

Job Schedule Calendar

Schedule Import