Job boards (like LinkedIn) can be a good source for finding job openings. Unfortunately the search results cannot always be filtered to a usable degree. Exfill (short for extraction) lets users scrape and parse jobs with more flexability provided by the default search.
Currently only LinkedIn is supported.
Directories:
src/exfill/parsers
- Contains parser(s)src/exfill/scrapers
- Contains scraper(s)src/exfill/support
- Contains
geckodriver
driver for FireFox which is used by Selenium - Download the latest driver from the Mozilla GeckoDriver repo in GitHub
- Contains
data/html
- Not in source control
- Contains HTML elements for a specific job posting
- Populated by a scraper
data/csv
- Not in source control
- Contains parsed information in a csv table
- Populated by a parser
- Also contains an error table
logs
- Not in source control
- Contains logs created during execution
Syntax should be as follows:
{
"linkedin": {
"username": "[email protected]",
"password": "password1"
}
}
There are two actions
required to generate usable data:
First is the scraping action. When called, a browser will open and perform a job query on the specified site
. Each posting will be exported to the data/html
directory.
The second action is parsing. Each job posting in data/html
will be opened and analyzed. Once all postings have been analyzed a single CSV file will be exported to data/csv
.
The csv file provides a high-level overview of all the jobs returned during the query. When imported to a spreadsheet, users can filter on fields not present in the original search options. Examples include sorting by companies or excluding certain industries.
This is required for all usage.
# Create and populate creds.json. Bash only:
cat <<EOF > creds.json
{
"linkedin": {
"username": "[email protected]",
"password": "password1"
}
}
EOF
# Install with git
git clone [email protected]:jay-law/job-scraper.git
# Install dependencies
$ poetry install
# Ensure creds.json exists (see above)
# Execute - Scrape linkedin
poetry run script-run -c config.ini -s linkedin scrape
# Execute - Parse linkedin
poetry run script-run -c config.ini parse