Skip to content

Latest commit

 

History

History
executable file
·
79 lines (72 loc) · 3.97 KB

File metadata and controls

executable file
·
79 lines (72 loc) · 3.97 KB

Scrutiny to JSON Import

Data from the Scrutiny export can be converted into a more usable format by running the included script:

$ cd search_gov_crawler/search_gov_spiders/utility_files
$ python import_plist.py --input_file ./scrutiny-2023-06-20.plist

Job Schedule Calendar

To start I have spread jobs throughout the day. I did not give any consideration to how long individual jobs run so this may need to be adjusted to allow for very long running jobs. All times are UTC. A maintenance window has been established each Wednesday between 1500 and 2100 so we can do releases without extra enabling/disabling of jobs.

✔️ - Scrape job scheduled ❌ - Maintenance window, do not schedule (Wed 1530-

UTC Mon Tue Wed Thu Fri
0030
0100
0130
0200
0230
0300
0330 ✔️ ✔️ ✔️ ✔️ ✔️
0400
0430
0500
0530 ✔️ ✔️ ✔️ ✔️ ✔️
0600
0630
0700
0730 ✔️ ✔️ ✔️ ✔️ ✔️
0800
0830
0900
0930 ✔️ ✔️ ✔️ ✔️ ✔️
1000
1030
1100
1130 ✔️ ✔️ ✔️ ✔️ ✔️
1200
1230
1300
1330 ✔️ ✔️ ✔️ ✔️ ✔️
1400
1430
1500
1530 ✔️ ✔️ ✔️ ✔️
1600
1630
1700
1730 ✔️ ✔️ ✔️ ✔️
1800
1830
1900
1930 ✔️ ✔️ ✔️ ✔️
2000
2030
2100
2130 ✔️ ✔️ ✔️ ✔️
2200
2230
2300
2330

Schedule Import

To initialize the above schedule in scrapydweb follow these instructions:

  • Run scrapyd, logparser, and scrapydweb using directions in main README file.
  • Apply schedule to database by running the init script $ cd search_gov_crawler/search_gov_spiders/utility_files $ python init_schedule.py --input_file=./crawl-sites.json
  • If there are any issues finding the database files, you may need to set the DATA_PATH environment variable. If you have not set DATA_PATH as an environment variable or in scrapydweb config, it will be set to <venv-path>/lib/python3.12/site-packages/scrapydweb/data/database
  • Refresh scrapydweb Timer Tasks page
  • All tasks will initially be in an inactive state. To reactivate:
    • Click Edit button on any task
    • Update action field from Add Task & Fire Right Now to Add Task
    • Update name field to remove - edit that has been added to end of task name
    • Click Check CMD button
    • Click [update] Add Task button
    • Verify task has a value for Next run time on the Timer Task page.