Homework 3: Clustering

Word Clustering

Your task is to apply minimum spanning tree algorithms to cluster similar words together in vector space.

Preparation

Install Graphviz.
Download the word vectors.
Download MSTWord.java, rename it to MSTLastname (e.g., MSTChoi), and put it under the graph.span package.
Modify INPUT_FILE and OUTPUT_FILE in MSTWord.java and run the program.
Open the output of the program in Graphviz.

Input Format

Each line in the input file consists of a word and its vector representation. For instance, the first line consists of the word "New" (0th column) and its vector representation (1st - 50th columns).

Task 1

For each pair of words, measure the Euclidean distance between their vectors. You may want to save the distances in a two-dimensional array such as:

float[][] distance = new float[500][500];

Task 2

Currently, MSTWord finds a random graph. Use Prim's algorithm to find the minimum spanning tree with respect to the Euclidean distances from Task 1.

Task 3

For each pair of words, measure the cosine distance instead of the Euclidean distance. The cosine distance is (1 - Cosine Similarity). Do Task 2 using the cosine distance.

Extra Credit

Try out different starting vertices for Prim's algorithm and see if they produce different minimum spanning trees.

Submission

Submit MSTLastname.java, and word_vector.dot.
Submit a report including your findings (e.g., weights of the minimum spanning trees, difference between the Euclidean and cosine distances).
If you are doing the extra credit, submit multiple word_vector.dot files for all spanning trees.

CS323: Data Structures and Algorithms

Syllabus.
Schedule.

Instructor

Jinho D. Choi

Emory University

Provide feedback

Saved searches

Use saved searches to filter your results more quickly