Skip to content

jo-room/job-scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

e758661 Â· Mar 3, 2025

History

55 Commits
Feb 8, 2025
Feb 8, 2025
Feb 8, 2025
Feb 8, 2025
Jan 16, 2025
Jan 17, 2025
Mar 3, 2025
Jan 3, 2025
Feb 11, 2025
Feb 11, 2025
Feb 13, 2025
Feb 11, 2025
Oct 17, 2024
Feb 8, 2025
Oct 28, 2024
Jan 3, 2025

Repository files navigation

Careers Page Scraper

Used for:

  1. Job alerts for companies that don't have a way to sign up for job alerts. Requires knowing the specific companies you want to follow, and reusing/writing scrapers for their careers page.
  2. Scraping Crunchbase pages to collect, de-duplicate, and discover companies.
    • Crunchbase has a lot of scrape protections and so needs to be run locally (manually triggered, instead of getting emailed when there's something new). It should also be run behind a VPN service, unless you want to risk your home IP getting blocked.

If you're a non-software-engineer friend who's interested in this and might be down for some coding-lite, talk to me 🙂. (Or just talk to me anyway, cos yay friends!)

Can be run as a local Python script to output new jobs (relative to the last run) matching your search terms. Or, can be hosted on AWS to run e.g. daily and send an email notification for new jobs. Only new scrape errors will be printed/emailed; it will not notify if the company also errored in the last run.

Local development and running

Assumes Python 3.12 and up.

Install dependencies

python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt

Create initial files

mkdir data
cp example_files/initial_run_record.json data/run_record.json

mkdir configs
cp example_files/config.json configs/config.json

Usage

source .venv/bin/activate

# Search all companies in config
python3 job_scrape.py configs/config.json data/run_record.json

# Limit search to company name
python3 job_scrape.py configs/config.json data/run_record.json --limit_company "example company name"

# Include custom scrapers
python3 job_scrape.py configs/config.json data/run_record.json --add_scrapers_file configs/scrapers.py

# Run headless (i.e. without opening Chrome)
python3 job_scrape.py configs/config.json data/run_record.json --headless

# My usual crunchbase run
.venv/bin/python job_scrape.py configs/crunchbase data/crunchbase_run_record.json --backup_run_record

Deployment

See docs/deployment.md.

Using the hosted scraper

See docs/usage.md.

Acknowledgements

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published