-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking on larger datasets (low accuracy) #2
Comments
Great catch. And yes this repo is very poorly benchmarked so this type of work is very much appreciated. If you have a repo demonstrating this difference I'd love to take a look! I should have some free time in the coming weeks to make some improvements. |
Absolutely, really appreciate your responsiveness. I'm a bit busy this week with a deadline (related to this work) but will try to share a repo soon after. A couple of suggestions: adding subsampling of frequent words, and having two weight matrices (separate one for context and center lookup). |
No problem happy to work towards making this repo into a good resource for people. As for subsampling, it is done in I've seen the two-weight matrices done before and I'm good to give it a try. Looking forward to making some improvements. |
I may be mistaken but I believe the code in |
I've been doing some more work with this repo, and I think it'd be productive to do some benchmarking beyond the given example. For example I've started working with the wiki8 text corpus (first 10^8 bytes of Wikipedia) and running some tests; Gensim's implementation gives an accuracy of ~24% on analogies, while I'm only seeing an accuracy of ~5% with this model.
Ideally we wouldn't see this kind of gap, so maybe it'd be a good idea to do some testing on larger datasets? I can also share some code to this end.
The text was updated successfully, but these errors were encountered: