Scrape news feeds from the New Jersey government
Known bugs befor v1.0
- GitHub Install
- Encoding
You can install the package using the following steps:
pip
install using an admin promptpip uninstall NJGovNews pip install -v git+https://github.com/TextCorpusLabs/NJGovNews.git
You can run the package as follows:
NJGovNews SITE -out FILE_OUT
The scraper currently supports the following SITE
s:
- The Department of the Treasury.
I.E.
NJGovNews treasury -out "c:/data/news/nj_treasury.csv"
This scraper uses requests-cache
to improve performance.
If you want to force a full reload of all the data, delete the file called 'SITE.cache.sqlite'.
It will be in the same folder as the .csv the scraper created.
You can install the package for development using the following steps:
Note: You can replace steps 1-3 using the VSCode Git:Clone command
- Download the project from GitHub
- Click the green "Code" button on the right. Select "Download Zip"
- Remove zip protections by right-clicking on the file, selecting properties, and checking "security: unblock"
- Unzip the folder. I recommend using the folder c:/repos/TextCorpusLabs/NJGovNews
- Run
pip
's edit install using an admin promptpip uninstall NJGovNews pip install -v -e c:/repos/TextCorpusLabs/NJGovNews
- Install the
nltk
add-ons using an admin promptpython -c "import nltk;nltk.download('punkt')"