-
Notifications
You must be signed in to change notification settings - Fork 28
Homework 3: Clustering
Your task is to apply minimum spanning tree algorithms to cluster similar words together in vector space.
- Install Graphviz.
- Download the word vectors.
- Download
MSTWord.java
, rename it toMSTLastname
(e.g.,MSTChoi
), and put it under thegraph.span
package. - Modify
INPUT_FILE
andOUTPUT_FILE
inMSTWord.java
and run the program. - Open the output of the program in Graphviz.
Each line in the input file consists of a word and its vector representation. For instance, the first line consists of the word "New" (0th column) and its vector representation (1st - 50th columns).
For each pair of words, measure the Euclidean distance between their vectors. You may want to save the distances in a two-dimensional array such as:
float[][] distance = new float[500][500];
Currently, MSTWord
finds a random graph. Use Prim's algorithm to find the minimum spanning tree with respect to the Euclidean distances from Task 1.
For each pair of words, measure the cosine distance instead of the Euclidean distance. The cosine distance is (1 - Cosine Similarity). Do Task 2 using the cosine distance.
Try out different starting vertices for Prim's algorithm and see if they produce different minimum spanning trees.
- Submit
MSTLastname.java
, andword_vector.dot
. - Submit a report including your findings (e.g., weights of the minimum spanning trees, difference between the Euclidean and cosine distances).
- If you are doing the extra credit, submit multiple
word_vector.dot
files for all spanning trees.
Copyright © 2014-2017 Emory University - All Rights Reserved.