Benchmarking on larger datasets (low accuracy) #2

arjun-mani · 2021-04-26T04:12:13Z

I've been doing some more work with this repo, and I think it'd be productive to do some benchmarking beyond the given example. For example I've started working with the wiki8 text corpus (first 10^8 bytes of Wikipedia) and running some tests; Gensim's implementation gives an accuracy of ~24% on analogies, while I'm only seeing an accuracy of ~5% with this model.

Ideally we wouldn't see this kind of gap, so maybe it'd be a good idea to do some testing on larger datasets? I can also share some code to this end.

ddehueck · 2021-04-26T04:59:20Z

Great catch. And yes this repo is very poorly benchmarked so this type of work is very much appreciated. If you have a repo demonstrating this difference I'd love to take a look!

I should have some free time in the coming weeks to make some improvements.

arjun-mani · 2021-04-26T19:40:14Z

Absolutely, really appreciate your responsiveness. I'm a bit busy this week with a deadline (related to this work) but will try to share a repo soon after. A couple of suggestions: adding subsampling of frequent words, and having two weight matrices (separate one for context and center lookup).

ddehueck · 2021-04-26T20:21:47Z

No problem happy to work towards making this repo into a good resource for people.

As for subsampling, it is done in sgns_loss.py wrt to a multinomial distribution in utils.py. I believe I found this method from another source so it may be worth reconsidering the actual implementation.

I've seen the two-weight matrices done before and I'm good to give it a try. Looking forward to making some improvements.

arjun-mani · 2021-04-26T20:41:07Z

I may be mistaken but I believe the code in sgns_loss.py is for negative sampling? What I meant by subsampling is to discard training examples based on frequency of center word in the dataset (Sec. 2.3 here: https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking on larger datasets (low accuracy) #2

Benchmarking on larger datasets (low accuracy) #2

arjun-mani commented Apr 26, 2021

ddehueck commented Apr 26, 2021

arjun-mani commented Apr 26, 2021

ddehueck commented Apr 26, 2021

arjun-mani commented Apr 26, 2021

Benchmarking on larger datasets (low accuracy) #2

Benchmarking on larger datasets (low accuracy) #2

Comments

arjun-mani commented Apr 26, 2021

ddehueck commented Apr 26, 2021

arjun-mani commented Apr 26, 2021

ddehueck commented Apr 26, 2021

arjun-mani commented Apr 26, 2021