-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimised max_element implementation as a specific case of nth_element with beam = 1 #981
base: master
Are you sure you want to change the base?
Conversation
…ement with beam size of 1
Apologies for the late reply @hieuhoang , I have some benchmarks, finally. tl;dr the beam1 code is 1-5 seconds faster depending on the test case. The bigger the output layer, the larger the difference.
To download the models and test for yourself, please get this tarball https://nbogoychev.com/files/speedtest.tar.gz |
I have no objections to approving the PR. Nick's results show a slight improvement for his model. My result, below, show hardly any change. Inching forward. <style> </style>
|
I'd like to do some more tests first. |
I may do when I have time. But your results are your own. I review PR to protect results on my models |
Extra findings: I tested on a proper student model and the conclusion is that my changes don't work when using LSH, but they do consistently offer better performance in all other cases. How different is the LSH output layer compared to a shortlisted output layer? |
I am getting some misaligned memory accesses from the LSH scores, and they take more cycles to load, that could be one of the issues..? |
LSH needs to be used with output layer without bias, shortlist doesn't have that restriction. Not sure where misalign memory is coming from, if the LSH params is set to an aligned-friendly values, eg --output-approx-knn 128 1024, you should have mem alignments. if max_element increase speed by 5% without LSH/shortlist, then not surprised that there's no noticeable affect when using LSH/shortlist since the vocab size you need to find the max is much smaller |
I see improvements with word alignment based shortlist, but not with LSH based shortlist, where I am consistently slower. I also get misaligned addresses only when using LSH based shortlist. How big are your output layers typically? I used 50 50 shortlist for my previous test. I can't get improvement with 100 1024 LSH. What settings do you use? When it comes to alignment, I'd expect the array to be 256 aligned at the start, i don't care about the end as I don't attempt to vectorise the overhang. |
Description
Add optimised max_element implementation as a specific case of nth_element with n = 1
Depending on the compiler used, this should speed up beam search by a factor of 2 to 10. Synthetic benchmark can be found here https://github.com/XapaJIaMnu/maxelem_test
A summary:
Cascade lake results
Ryzen 9 5900HS results
Added dependencies: none
How to test
Just load any model with the new code path and test it with beam size of 1. In our testing this reduced runtime by about 1%.
I didn't run all regression tests because there's something broken in them right now.
Checklist