The goal was to research model-based clustering methods, notably the Distance Dependent Chinese Restaurant Process (ddCRP), and propose an incremental clustering system which would be capable of maintaining the growing number of topic clusters of news articles coming online from a crawler. LDA, LSA, and doc2vec methods were used to represent a document as a fixed-length numeric vector. Cluster assignments given by a proof-of-concept implementation of such a system were evaluated using various metrics, notably purity, F-measure and V-measure. A modification of V-measure -- NV-measure -- was introduced in order to penalize an excessive or insufficient number of clusters. The best results were achieved with doc2vec and ddCRP.
Due to copyright, news articles used for experiments are only available at the university library.
Full thesis text: thesis.pdf
Poster: Vana_Martin_2018.pdf
BibTeX citation:
@MASTERSTHESIS {martinvana2018,
author = "Martin Váňa",
title = "Incremental News Clustering",
school = "University of West Bohemia",
year = "2018",
address = "Pilsen",
month = "may"
}
Requirements
- Python 3.5
- Pip
- Pipenv
$ sudo apt-get install python3 python3-tk python3-pip
$ pip3 install pipenv
$ pipenv install --dev
If it fails for some reason try
pipenv install --dev --skip-lock
export PYTHONPATH='.'
$ pipenv shell
$ pipenv run python <script_name>.py
$ pipenv run pytest tests