Careers Page Scraper

Used for:

Job alerts for companies that don't have a way to sign up for job alerts. Requires knowing the specific companies you want to follow, and reusing/writing scrapers for their careers page.
- See common_scrapers.py for the supported types of career pages.
Scraping Crunchbase pages to collect, de-duplicate, and discover companies.
- Crunchbase has a lot of scrape protections and so needs to be run locally (manually triggered, instead of getting emailed when there's something new). It should also be run behind a VPN service, unless you want to risk your home IP getting blocked.

If you're a non-software-engineer friend who's interested in this and might be down for some coding-lite, talk to me 🙂. (Or just talk to me anyway, cos yay friends!)

Can be run as a local Python script to output new jobs (relative to the last run) matching your search terms. Or, can be hosted on AWS to run e.g. daily and send an email notification for new jobs. Only new scrape errors will be printed/emailed; it will not notify if the company also errored in the last run.

Local development and running

Assumes Python 3.12 and up.

Install dependencies

python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt

Create initial files

mkdir data
cp example_files/initial_run_record.json data/run_record.json

mkdir configs
cp example_files/config.json configs/config.json

Usage

source .venv/bin/activate

# Search all companies in config
python3 job_scrape.py configs/config.json data/run_record.json

# Limit search to company name
python3 job_scrape.py configs/config.json data/run_record.json --limit_company "example company name"

# Include custom scrapers
python3 job_scrape.py configs/config.json data/run_record.json --add_scrapers_file configs/scrapers.py

# Run headless (i.e. without opening Chrome)
python3 job_scrape.py configs/config.json data/run_record.json --headless

# My usual crunchbase run
.venv/bin/python job_scrape.py configs/crunchbase data/crunchbase_run_record.json --backup_run_record

Name	Name	Last commit message	Last commit date
Latest commit jo-room Update readme with custom scrapers option Mar 3, 2025 e758661 · Mar 3, 2025 History 55 Commits
.github/workflows	.github/workflows	Add optional config	Feb 8, 2025
deploy	deploy	Update docs to config.json	Feb 8, 2025
docs	docs	Update docs to config.json	Feb 8, 2025
example_files	example_files	Update docs to config.json	Feb 8, 2025
.gitignore	.gitignore	Update readme	Jan 16, 2025
Dockerfile	Dockerfile	Fix dockerfile	Jan 17, 2025
README.md	README.md	Update readme with custom scrapers option	Mar 3, 2025
__init__.py	__init__.py	Configure schedule via s3 object	Jan 3, 2025
common_scrapers.py	common_scrapers.py	Allow no search terms and filter on country column	Feb 11, 2025
job_scrape.py	job_scrape.py	Allow json or py config, with optional scrapers	Feb 11, 2025
lambda_configure_schedule.py	lambda_configure_schedule.py	Enable multiple hours	Feb 13, 2025
lambda_function.py	lambda_function.py	Allow json or py config, with optional scrapers	Feb 11, 2025
license.txt	license.txt	First commit	Oct 17, 2024
models.py	models.py	Add optional config	Feb 8, 2025
requirements.txt	requirements.txt	To intercept network requests	Oct 28, 2024
rerun_configure_schedules.py	rerun_configure_schedules.py	Improve README, organize	Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Careers Page Scraper

Local development and running

Deployment

Using the hosted scraper

Acknowledgements

About

Releases

Packages

Languages

License

jo-room/job-scrape

Folders and files

Latest commit

History

Repository files navigation

Careers Page Scraper

Local development and running

Deployment

Using the hosted scraper

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages