Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Investigate optimizing CTokenListCategory::updateOrderedCommonTokenIds #2403

Open
droberts195 opened this issue Sep 7, 2022 · 0 comments
Labels

Comments

@droberts195
Copy link
Contributor

The method CTokenListCategory::updateOrderedCommonTokenIds is known to be potentially inefficient. It contains this comment:

    // The algorithm used here is technically O(N^3 * log(N)), but has these
    // redeeming features:
    // 1. It does not allocate any memory.
    // 2. It is order O(N * log(N)) in the (likely) case of no change required.
    // 3. In the case where the nested loops run many times it will be reducing
    //    the value of N for subsequent calls, thus making those subsequent
    //    calls much faster.  For example, suppose we start off with 100 ordered
    //    tokens, and one call to this method runs the outer loop 50 times to
    //    reduce the ordered set to 50 tokens.  Then the next call will start
    //    with only 50 ordered tokens.
    // There's a trade-off here, because reducing the complexity would mean
    // changing the data structures, which would add complexity to state
    // persistence, or allocating temporary storage, which would increase the
    // runtime a lot in the common case where the previously ordered common
    // tokens are present in the same order in the new tokens.

Usually we get away with this but recently we have seen some data sets where the inefficient algorithm causes problems, namely when there are many tokens in each message:

Screenshot 2022-09-07 at 13 01 25

In the case of m_CommonUniqueTokenIds having more than a certain number of values in it, we should take the time at the beginning of CTokenListCategory::updateOrderedCommonTokenIds to build a more efficient data structure to work with. For small numbers of tokens the time taken to build a new data structure (in particular memory allocations) is likely to outweigh the cost of the looping we currently do, so some experimentation will be needed to find the point at which it's worthwhile. But for large numbers of tokens we should see a performance improvement from it.

@droberts195 droberts195 added the :ml label Sep 7, 2022
edsavage added a commit to elastic/elasticsearch that referenced this issue Sep 8, 2022
Categorization of strings which break down to a huge number of tokens can cause the C++ backend process to choke - see elastic/ml-cpp#2403.

This PR adds a limit filter to the default categorization analyzer which caps the number of tokens passed to the backend at 100.

Unfortunately this isn't a complete panacea to all the issues surrounding categorization of many tokened / large messages as verification checks on the frontend can also fail due to calls to the datafeed _preview API returning an excessive amount of data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant