-
Notifications
You must be signed in to change notification settings - Fork 62
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Investigate optimizing CTokenListCategory::updateOrderedCommonTokenIds #2403
Labels
Comments
edsavage
added a commit
to elastic/elasticsearch
that referenced
this issue
Sep 8, 2022
Categorization of strings which break down to a huge number of tokens can cause the C++ backend process to choke - see elastic/ml-cpp#2403. This PR adds a limit filter to the default categorization analyzer which caps the number of tokens passed to the backend at 100. Unfortunately this isn't a complete panacea to all the issues surrounding categorization of many tokened / large messages as verification checks on the frontend can also fail due to calls to the datafeed _preview API returning an excessive amount of data.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The method
CTokenListCategory::updateOrderedCommonTokenIds
is known to be potentially inefficient. It contains this comment:Usually we get away with this but recently we have seen some data sets where the inefficient algorithm causes problems, namely when there are many tokens in each message:
In the case of
m_CommonUniqueTokenIds
having more than a certain number of values in it, we should take the time at the beginning ofCTokenListCategory::updateOrderedCommonTokenIds
to build a more efficient data structure to work with. For small numbers of tokens the time taken to build a new data structure (in particular memory allocations) is likely to outweigh the cost of the looping we currently do, so some experimentation will be needed to find the point at which it's worthwhile. But for large numbers of tokens we should see a performance improvement from it.The text was updated successfully, but these errors were encountered: