searchgov-spider

The home for the spider that supports Search.gov.

About

With the move away from using Bing to provide search results for some domains, we need a solution that can index sites that were previously indexed by Bing and/or that do not have standard sitemaps. Additionally, the Scrutiny desktop application is being run manually to provide coverage for a few dozen domains that cannot be otherwise indexed. The spider application is our solution to both the Bing problem and the removal of manual steps. The documentation here represents the most current state of the application and our design.

Technologies

We currently run python 3.12. The spider is based on the open source scrapy framework. On top of that we use several other open source libraries and scrapy plugins. See our requirements file for more details.

Core Scrapy File Structure

*Note: Other files and directories are within the repository but the folders and files below relate to those needed for the scrapy framework.

├── search_gov_crawler              # scrapy root
|   ├── elasticsearch               # code related to indexing content in elasticsearch
│   ├── search_gov_spider           # scrapy project dir
│   │   ├── extensions              # custom scrapy extensions
│   │   ├── helpers                 # common functions
│   │   ├── spiders                 # all search_gov_spider spiders
│   │   │   ├── domain_spider.py    # for html pages
│   │   │   ├── domain_spider_js.py # for js pages
│   │   ├── utility_files           # json files with default domains to scrape
│   │   ├── items.py                # defines individual output of scrapes
│   │   ├── middlewares.py          # custom middleware code
│   │   ├── monitors.py             # custom spidermon monitors
│   │   ├── pipelines.py            # custom item pipelines
│   │   ├── settings.py             # settings that control all scrapy jobs
│   ├── scrapy.cfg

Quick Start

Insall and activate virtual environment:

python -m venv venv
. venv/bin/activate

Install required python modules:

pip install -r requirements.txt

# required for domains that need javascript
playwright install --with-deps
playwright install chrome --force

Run A Spider:

# to run for a non-js domain:
scrapy crawl domain_spider -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com -a output_target=csv

# or to run for a js domain
scrapy crawl domain_spider_js -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com/js -a output_target=csv

Check Output:

The output of this scrape is one or more csv files containing URLs in the output directory.

Learn More:

For more advanced usage, see the Advanced Setup and Use Page

Name		Name	Last commit message	Last commit date
Latest commit History 417 Commits
.circleci		.circleci
.github		.github
cicd-scripts		cicd-scripts
docs		docs
search_gov_crawler		search_gov_crawler
tests		tests
.codeclimate.yml		.codeclimate.yml
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.profile		.profile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
__init__.py		__init__.py
appspec.yml		appspec.yml
docker-compose.yml		docker-compose.yml
manifest.yml		manifest.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
runtime.txt		runtime.txt
setup.cfg		setup.cfg
startcommand.sh		startcommand.sh
test-codeclimate.sh		test-codeclimate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

searchgov-spider

Table of contents

About

Technologies

Core Scrapy File Structure

Quick Start

Helpful Links

About

Releases

Packages

Contributors 7

Languages

License

GSA-TTS/searchgov-spider

Folders and files

Latest commit

History

Repository files navigation

searchgov-spider

Table of contents

About

Technologies

Core Scrapy File Structure

Quick Start

Helpful Links

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages