Word Embeddings and Document Vectors

This is the source code to go along with the series of blog articles

The code employs,

Elasticsearch (localhost:9200) as the repository
1. to save tokens to, and get them as needed.
2. to save word-vectors (pre-trained or custom) to, and get them as needed.
See the Pipfle for Python dependencies

Usage

Generate tokens for the 20-news corpus & the movie review data set and save them to Elasticsearch.
- The dataset for 20-news is downloaded as part of the script. But you need to download the movie review dataset separately.
- The shell script & python code in the folders text-data/twenty-news & text-data/acl-imdb
Generate custom word vectors for the two text corpus in 1 above and save them to Elasticsearch. text-data/twenty-news/vectors & text-data/acl-imdb/vectors directories have the scripts
Process pre-trained vectors and save them to Elasticsearch. Look into pre-trained-vectors/ for the code. You need to download the actual published vectors from their sources. We have used Word2Vec, Glove and FastText in these articles.
The script run.sh can be configured to run whichever combination of the pipeline steps.
The logs contain the F-scores and timing results. Create a "logs" directory before running the run.sh script

mkdir logs

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
pre-trained-vectors		pre-trained-vectors
text-data		text-data
Pipfile		Pipfile
README.md		README.md
classify.py		classify.py
initLogs.py		initLogs.py
logging.yaml		logging.yaml
run.sh		run.sh
tokens.py		tokens.py
vectorizers.py		vectorizers.py
wordvectors.py		wordvectors.py