Skip to content

ashokc/Word-Embeddings-and-Document-Vectors

Repository files navigation

Word Embeddings and Document Vectors

This is the source code to go along with the series of blog articles

The code employs,

  • Elasticsearch (localhost:9200) as the repository

    1. to save tokens to, and get them as needed.
    2. to save word-vectors (pre-trained or custom) to, and get them as needed.
  • See the Pipfle for Python dependencies

Usage

  1. Generate tokens for the 20-news corpus & the movie review data set and save them to Elasticsearch.

    • The dataset for 20-news is downloaded as part of the script. But you need to download the movie review dataset separately.
    • The shell script & python code in the folders text-data/twenty-news & text-data/acl-imdb
  2. Generate custom word vectors for the two text corpus in 1 above and save them to Elasticsearch. text-data/twenty-news/vectors & text-data/acl-imdb/vectors directories have the scripts

  3. Process pre-trained vectors and save them to Elasticsearch. Look into pre-trained-vectors/ for the code. You need to download the actual published vectors from their sources. We have used Word2Vec, Glove and FastText in these articles.

  4. The script run.sh can be configured to run whichever combination of the pipeline steps.

  5. The logs contain the F-scores and timing results. Create a "logs" directory before running the run.sh script

    mkdir logs