Skip to content

GSA-TTS/searchgov-spider

Repository files navigation

searchgov-spider

The home for the spider that supports Search.gov.

Table of contents

About

With the move away from using Bing to provide search results for some domains, we need a solution that can index sites that were previously indexed by Bing and/or that do not have standard sitemaps. Additionally, the Scrutiny desktop application is being run manually to provide coverage for a few dozen domains that cannot be otherwise indexed. The spider application is our solution to both the Bing problem and the removal of manual steps. The documentation here represents the most current state of the application and our design.

Technologies

We currently run python 3.12. The spider is based on the open source scrapy framework. On top of that we use several other open source libraries and scrapy plugins. See our requirements file for more details.

Core Scrapy File Structure

*Note: Other files and directories are within the repository but the folders and files below relate to those needed for the scrapy framework.

├── search_gov_crawler              # scrapy root
|   ├── elasticsearch               # code related to indexing content in elasticsearch
│   ├── search_gov_spider           # scrapy project dir
│   │   ├── extensions              # custom scrapy extensions
│   │   ├── helpers                 # common functions
│   │   ├── spiders                 # all search_gov_spider spiders
│   │   │   ├── domain_spider.py    # for html pages
│   │   │   ├── domain_spider_js.py # for js pages
│   │   ├── utility_files           # json files with default domains to scrape
│   │   ├── items.py                # defines individual output of scrapes
│   │   ├── middlewares.py          # custom middleware code
│   │   ├── monitors.py             # custom spidermon monitors
│   │   ├── pipelines.py            # custom item pipelines
│   │   ├── settings.py             # settings that control all scrapy jobs
│   ├── scrapy.cfg

Quick Start

  1. Insall and activate virtual environment:
python -m venv venv
. venv/bin/activate
  1. Install required python modules:
pip install -r requirements.txt

# required for domains that need javascript
playwright install --with-deps
playwright install chrome --force
  1. Run A Spider:
# to run for a non-js domain:
scrapy crawl domain_spider -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com -a output_target=csv

# or to run for a js domain
scrapy crawl domain_spider_js -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com/js -a output_target=csv
  1. Check Output:

The output of this scrape is one or more csv files containing URLs in the output directory.

  1. Learn More:

For more advanced usage, see the Advanced Setup and Use Page

Helpful Links

About

The home for the spider that supports search.gov

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published