Skip to content

vraxzeztan/20-Newsgroup-Text-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

20 Newsgroup Text Classification

Abstract

This data set consists of 20000 messages taken from 20 newsgroups.

Sources

Original Owner and Donor
Tom Mitchell,
School of Computer Science,
Carnegie Mellon University

Data Characteristics

One thousand Usenet articles were taken from each of the following 20 newsgroups.
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
Approximately 4% of the articles are crossposted. The articles are typical postings and thus have headers including subject lines, signature files, and quoted portions of other articles.

Data Format

Each newsgroup is stored in a subdirectory, with each article stored as a separate file.

Link to the Dataset

http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups

A sample article from the Dataset

image info

Implementation

Multinomial Naive Bayes was used to classify a given article into one of the 20 newsgroups.The Multinomial Naive Bayes implementation from sklearn aswell as my own self implementation were used for the classification and their results were compared.It was seen that both the implementations gave the exact same results hence both the implementations must be same.

Accuracy

Sklearn's implementation

F1 score: 0.83

Self implementation

F1 score: 0.84