Skip to content

A Python Package which helps to scrape all news details from any news websites

License

Notifications You must be signed in to change notification settings

Ayushman278/news-fetch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPI version License Documentation Status

news-fetch

news-fetch is an open source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely. news-fetch combines the power of multiple state-of-the-art libraries and tools, such as news-please - Felix Hamborg and Newspaper3K - Lucas (欧阳象) Ou-Yang. This package consist of both features provided my Felix's work and Lucas' work.

I built this to reduce most of NaN or '' or [] or 'None' values while scraping for some newswesites. Platform-independent and written in Python 3. This package can be very easily used by programmers and developers to provide access to the news data to their programs.

Source Link
PyPI: https://pypi.org/project/news-fetch/
Repository: https://santhoshse7en.github.io/news-fetch/
Documentation: https://santhoshse7en.github.io/news-fetch_doc/

Dependencies

  • news-please
  • newspaper3k
  • beautifulsoup4
  • fake_useragent
  • selenium
  • chromedriver-binary
  • fake_useragent
  • pandas

Extracted information

news-please extracts the following attributes from news articles. Also, have a look at an examplary json file extracted by news-please.

  • headline
  • name(s) of author(s)
  • publication date
  • publication
  • category
  • source_domain
  • article
  • summary
  • keyword
  • url
  • language

Dependencies Installation

Use the package manager pip to install following

pip install -r requirements.txt

Usage

Download it by clicking the green download button here on Github. To extract URLs from targeted website call google_search function, you only need to parse argument of keyword and newspaper link.

>>> from newsfetch.google import google_search
>>> google = google_search('Alcoholics Anonymous', 'https://timesofindia.indiatimes.com/')

Directory of google search results urls

google

To scrape the all news details call newspaper function

>>> from newsfetch.news import newspaper
>>> news = newspaper('https://www.bbc.co.uk/news/world-48810070')

Directory of news

newsdir

>>> news.headline

'g20 summit: trump and xi agree to restart us china trade talks'

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

About

A Python Package which helps to scrape all news details from any news websites

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%