The home for the spider that supports search.gov.
The spider uses the open source scrapy framework.
The spiders can be found at search_gov_crawler/search_gov_spiders/spiders/
.
*Note: Other files and directories are within the repository but the folders and files below relate to those needed for the scrapy framework.
├── search_gov_crawler ( scrapy root )
│ ├── search_gov_spider ( scrapy project *Note multiple projects can exist within a project root )
│ │ ├── extensions ( includes custom scrapy extensions )
│ │ ├── helpers ( includes common functions )
│ │ ├── spiders
│ │ │ ├── domain_spider.py ( spider for html pages )
│ │ │ ├── domain_spider_js.py ( spider for js pages )
│ │ ├── utility_files ( includes json files with default domains to scrape )
│ │ ├── items.py
│ │ ├── middlewares.py
│ │ ├── pipelines.py
│ │ ├── settings.py
│ ├── scrapy.cfg
The spider can either scrape for URLs from the list of required domains or take in a domain and starting URL to scrape a site/domain.
Running the spider produces a list of urls found in search_gov_crawler/search_gov_spiders/spiders/scrapy_urls/{spider_name}/{spider_name}_{date}-{UTC_time}.txt
as specified by FEEDS
in settings.py
.
Make sure to run pip install -r requirements.txt
and playwright install
before running any spiders.
Navigate down to search_gov_crawler/search_gov_spiders
, then enter the command below:
scrapy crawl domain_spider
to run for all urls / domains that do not require javacript handling. To run for all sites that require javascript run:
scrapy crawl domain_spider_js
^^^ These will take a long time
In the same directory specified above, enter the command below, adding the domain and starting URL for the crawler:
scrapy crawl domain_spider -a allowed_domains=example.com -a start_urls=www.example.com
or
scrapy crawl domain_spider -a allowed_domains=example.com -a start_urls=www.example.com
Make sure to run pip install -r requirements.txt
and playwright install
before running any spiders.
-
Navigate to the spiders directory
-
Enter one of two following commands:
-
This command will output the yielded URLs in the destination (relative to the spiders directory) and file format specified in the
search_gov_crawler/search_gov_spiders/pipelines.py
:$ scrapy runspider <spider_file.py>
-
This command will output the yielded URLs in the destination (relative to the spiders directory) and file format specified by the user:
$ scrapy runspider <spider_file.py> -o <filepath_to_output_folder/spider_output_filename.csv>
-
-
First, install Scrapyd and scrapyd-client (library that helps eggify and deploy the Scrapy project to the Scrapyd server):
-
$ pip install scrapyd
-
$ pip install git+https://github.com/scrapy/scrapyd-client.git
-
-
Next, navigate to the scrapyd_files directory and start the server :
$ scrapyd
- Note: the directory where you start the server is arbitrary. It's simply where the logs and Scrapy project FEED destination (relative to the server directory) will be.
-
Navigate to the Scrapy project root directory and run this command to eggify the Scrapy project and deploy it to the Scrapyd server:
$ scrapyd-deploy default
-
Note: This will simply deploy it to a local Scrapyd server. To add custom deployment endpoints, you navigate to the scrapy.cfg file and add or customize endpoints.
For instance, if you wanted local and production endpoints:
[settings] default = search_gov_spiders.settings [deploy: local] url = http://localhost:6800/ project = search_gov_spiders [deploy: production] url = <IP_ADDRESS> project = search_gov_spiders
To deploy:
# deploy locally scrapyd-deploy local # deploy production scrapyd-deploy production
-
-
For an interface to view jobs (pending, running, finished) and logs, access http://localhost:6800/. However, to actually manipulate the spiders deployed to the Scrapyd server, you'll need to use the Scrapyd JSON API.
Some most-used commands:
-
Schedule a job:
$ curl http://localhost:6800/schedule.json -d project=search_gov_spiders -d spider=<spider_name>
-
Check load status of a service:
$ curl http://localhost:6800/daemonstatus.json
-
-
Navigate to anywhere within the Scrapy project root directory and run this command:
$ scrapy genspider -t crawl <spider_name> "<spider_starting_domain>"
-
Open the
/search_gov_spiders/search_gov_spiders/spiders/boilerplate.py
file and replace the lines of the generated spider with the lines of the boilerplate spider as dictated in the boilerplate file. -
Modify the
rules
in the new spider as needed. Here's the Scrapy rules documentation for the specifics. -
To update the Scrapyd server with the new spider, run:
$ scrapyd-deploy <default or endpoint_name> ## Running Against All Listed Search.gov Domains
This process allows for scrapy to be run directly using an in-memory scheduler. The schedule is based on the initial schedule setup in the utility files readme. The process will run until killed.
-
Source virtual environment and update dependencies.
-
Start scheduler
$ python search_gov_crawler/scrapy_scheduler.py
-
Source virtual environment, update dependencies, and change working directory to
search_gov_crawler
-
Start scrapyd
$ scrapyd
-
Build latest version of scrapy project (if any changes have been made since last run)
$ scrapyd-deploy local -p search_gov_spiders
-
Start logparser
$ python -m search_gov_logparser
-
Start scrapydweb
$ python -m search_gov_scrapydweb