Skip to content

jay-law/job-scraper

Repository files navigation

Introduction

Job boards (like LinkedIn) can be a good source for finding job openings. Unfortunately the search results cannot always be filtered to a usable degree. Exfill (short for extraction) lets users scrape and parse jobs with more flexability provided by the default search.

Currently only LinkedIn is supported.

Project Structure

Directories:

  • src/exfill/parsers - Contains parser(s)
  • src/exfill/scrapers - Contains scraper(s)
  • src/exfill/support
  • data/html
    • Not in source control
    • Contains HTML elements for a specific job posting
    • Populated by a scraper
  • data/csv
    • Not in source control
    • Contains parsed information in a csv table
    • Populated by a parser
    • Also contains an error table
  • logs
    • Not in source control
    • Contains logs created during execution

creds.json File

Syntax should be as follows:

{
    "linkedin": {
        "username": "[email protected]",
        "password": "password1"
    }
}

Usage

There are two actions required to generate usable data:

First is the scraping action. When called, a browser will open and perform a job query on the specified site. Each posting will be exported to the data/html directory.

The second action is parsing. Each job posting in data/html will be opened and analyzed. Once all postings have been analyzed a single CSV file will be exported to data/csv.

The csv file provides a high-level overview of all the jobs returned during the query. When imported to a spreadsheet, users can filter on fields not present in the original search options. Examples include sorting by companies or excluding certain industries.

Add Creds File

This is required for all usage.

# Create and populate creds.json.  Bash only:
cat <<EOF > creds.json
{
    "linkedin": {
        "username": "[email protected]",
        "password": "password1"
    }
}
EOF

Execute

# Install with git
git clone [email protected]:jay-law/job-scraper.git

# Install dependencies
$ poetry install

# Ensure creds.json exists (see above)

# Execute - Scrape linkedin
poetry run script-run -c config.ini -s linkedin scrape

# Execute - Parse linkedin
poetry run script-run -c config.ini parse