Extract data

First, I extracted data from ferdowsi-data.txt, which Its size was 14GB. Because of its large size, it had to be read line by line. After extracting data and make data frame, I save results in Data.csv.

Preprocess on news dataset

In this project, first I start to visualize data in 3 groups to understand better data set, which locate on data_visualization. Second I use hazm library for text processing in Persian data set on DataCleaning file.

clustering on news dataset

In this session, I use 1000 text of news for clustering in three ways:

• kmeans + BOW(bag of words)

• kmeans + tf-idf

• kmeans + fasttext

kmeans + BOW and kmeans + tf-idf are located on Clustering kmeans+ fasttext are located on Clustering_fasttext

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Clustering		Clustering
Extract_Data		Extract_Data
Preprocess		Preprocess
.DS_Store		.DS_Store
README.md		README.md
savetocsv.ipynb		savetocsv.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract data

Preprocess on news dataset

clustering on news dataset

About

Releases

Packages

Languages

mohadesehjm/Preprocess_and_Clustering_on_news_datasets

Folders and files

Latest commit

History

Repository files navigation

Extract data

Preprocess on news dataset

clustering on news dataset

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages