Skip to content

Homework 3: Clustering

Jinho D. Choi edited this page Oct 31, 2016 · 1 revision

Word Clustering

Your task is to apply minimum spanning tree algorithms to cluster similar words together in vector space.

Preparation

  • Install Graphviz.
  • Download the word vectors.
  • Download MSTWord.java, rename it to MSTLastname (e.g., MSTChoi), and put it under the graph.span package.
  • Modify INPUT_FILE and OUTPUT_FILE in MSTWord.java and run the program.
  • Open the output of the program in Graphviz.

Input Format

Each line in the input file consists of a word and its vector representation. For instance, the first line consists of the word "New" (0th column) and its vector representation (1st - 50th columns).

Task 1

For each pair of words, measure the Euclidean distance between their vectors. You may want to save the distances in a two-dimensional array such as:

float[][] distance = new float[500][500];

Task 2

Currently, MSTWord finds a random graph. Use Prim's algorithm to find the minimum spanning tree with respect to the Euclidean distances from Task 1.

Task 3

For each pair of words, measure the cosine distance instead of the Euclidean distance. The cosine distance is (1 - Cosine Similarity). Do Task 2 using the cosine distance.

Extra Credit

Try out different starting vertices for Prim's algorithm and see if they produce different minimum spanning trees.

Submission

  • Submit MSTLastname.java, and word_vector.dot.
  • Submit a report including your findings (e.g., weights of the minimum spanning trees, difference between the Euclidean and cosine distances).
  • If you are doing the extra credit, submit multiple word_vector.dot files for all spanning trees.

CS323: Data Structures and Algorithms

Instructor


Emory University

Clone this wiki locally