GitHub - notnews/news-url-classifier: Use human readable portion of the URL to classify the kind of news

News URL Classifier

Use strings in the URL to classify the kind of news.

Data from: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OTJMYQ

Hard/soft news labels for ~ 300 articles hand-coded.

Reference

See this paper: https://osf.io/krhmq

"Finally, we point curious researchers to the paths of URLs, which often contain human-readable text similar to titles. For example, from the URL https://www.theguardian.com/politics/2023/aug/01/boris-johnson-swimming-pool-newts-oxfordshire, one could use “boris johnson swimming pool newts oxfordshire” as input for an NLP classifier. Although we know of no studies doing so, this seems like an intriguing possibility."

For regex versions from Bakshy et al., see https://github.com/themains/rdomains/blob/master/R/not_news.R#L9 and https://github.com/notnews/notnews/blob/master/notnews/soft_news_url_cat_us.py

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
current-output-homepage_with_coded_hard_soft.csv		current-output-homepage_with_coded_hard_soft.csv
url_classifier.ipynb		url_classifier.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News URL Classifier

Reference

About

Releases

Packages

Languages

notnews/news-url-classifier

Folders and files

Latest commit

History

Repository files navigation

News URL Classifier

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages