Dependencies

Dependencies are listed in requirements.txt

Install

pip install -r requirements.txt

Alternatively you can use conda:

conda create -n python-news-crawler python=3.9
conda activate python-news-crawler
conda env update -f conda.yaml

To start the application we want to execute the main.py file

python .\main.py

This project consists of a RPA (Robot Process Automation) script that will open the nytimes.com website.
Collect data from config.ini (Search Phrase, Categories and Months (1,2,3))
Search the search phrase along with categories (if provided) and months
After iterating through all posts, the application will create a folder for this search inside src/
The application will save all imgs in src/{search_phrase}/img
The application will generate an excel file with significant data in src/{search_phrase}/news.xlsx
If all runs well the log will indicate success and exit graciously

Some searchs can lead to hundreds of thousands results.
There is a function called load_more() being called in crawler.py at iterate_news() function
This function always search for the "Load more" button at the bottom of the news page
This can increase the amount of time that the script will take to finish
To check only the first 10 news posts, comment the line that the load_more() is being called.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
conda.yaml		conda.yaml
config.ini		config.ini
constants.py		constants.py
crawler.py		crawler.py
main.py		main.py
requirements.txt		requirements.txt
robot.yaml		robot.yaml
schemas.py		schemas.py