This is the source code to go along with the series of blog articles
- Word Embeddings and Document Vectors: Part 1. Similarity
- Word Embeddings and Document Vectors: Part 2. Classification
The code employs,
-
Elasticsearch (localhost:9200) as the repository
- to save tokens to, and get them as needed.
- to save word-vectors (pre-trained or custom) to, and get them as needed.
-
See the Pipfle for Python dependencies
-
Generate tokens for the 20-news corpus & the movie review data set and save them to Elasticsearch.
- The dataset for 20-news is downloaded as part of the script. But you need to download the movie review dataset separately.
- The shell script & python code in the folders text-data/twenty-news & text-data/acl-imdb
-
Generate custom word vectors for the two text corpus in 1 above and save them to Elasticsearch. text-data/twenty-news/vectors & text-data/acl-imdb/vectors directories have the scripts
-
Process pre-trained vectors and save them to Elasticsearch. Look into pre-trained-vectors/ for the code. You need to download the actual published vectors from their sources. We have used Word2Vec, Glove and FastText in these articles.
-
The script run.sh can be configured to run whichever combination of the pipeline steps.
-
The logs contain the F-scores and timing results. Create a "logs" directory before running the run.sh script
mkdir logs