[ML] Investigate optimizing CTokenListCategory::updateOrderedCommonTokenIds #2403

droberts195 · 2022-09-07T13:18:31Z

The method CTokenListCategory::updateOrderedCommonTokenIds is known to be potentially inefficient. It contains this comment:

    // The algorithm used here is technically O(N^3 * log(N)), but has these
    // redeeming features:
    // 1. It does not allocate any memory.
    // 2. It is order O(N * log(N)) in the (likely) case of no change required.
    // 3. In the case where the nested loops run many times it will be reducing
    //    the value of N for subsequent calls, thus making those subsequent
    //    calls much faster.  For example, suppose we start off with 100 ordered
    //    tokens, and one call to this method runs the outer loop 50 times to
    //    reduce the ordered set to 50 tokens.  Then the next call will start
    //    with only 50 ordered tokens.
    // There's a trade-off here, because reducing the complexity would mean
    // changing the data structures, which would add complexity to state
    // persistence, or allocating temporary storage, which would increase the
    // runtime a lot in the common case where the previously ordered common
    // tokens are present in the same order in the new tokens.

Usually we get away with this but recently we have seen some data sets where the inefficient algorithm causes problems, namely when there are many tokens in each message:

In the case of m_CommonUniqueTokenIds having more than a certain number of values in it, we should take the time at the beginning of CTokenListCategory::updateOrderedCommonTokenIds to build a more efficient data structure to work with. For small numbers of tokens the time taken to build a new data structure (in particular memory allocations) is likely to outweigh the cost of the looping we currently do, so some experimentation will be needed to find the point at which it's worthwhile. But for large numbers of tokens we should see a performance improvement from it.

The text was updated successfully, but these errors were encountered:

Categorization of strings which break down to a huge number of tokens can cause the C++ backend process to choke - see elastic/ml-cpp#2403. This PR adds a limit filter to the default categorization analyzer which caps the number of tokens passed to the backend at 100. Unfortunately this isn't a complete panacea to all the issues surrounding categorization of many tokened / large messages as verification checks on the frontend can also fail due to calls to the datafeed _preview API returning an excessive amount of data.

droberts195 added the :ml label Sep 7, 2022

droberts195 mentioned this issue Sep 8, 2022

[ML] Performance improvements for categorization jobs elastic/elasticsearch#89824

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Investigate optimizing CTokenListCategory::updateOrderedCommonTokenIds #2403

[ML] Investigate optimizing CTokenListCategory::updateOrderedCommonTokenIds #2403

droberts195 commented Sep 7, 2022

[ML] Investigate optimizing CTokenListCategory::updateOrderedCommonTokenIds #2403

[ML] Investigate optimizing CTokenListCategory::updateOrderedCommonTokenIds #2403

Comments

droberts195 commented Sep 7, 2022