Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Don't try and correct for sample count when estimating statistic variances for anomaly detection #2677

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Jun 3, 2024

Currently, we:

  1. Try to ensure each metric sample contains the same number of values
  2. Correct for the variance assuming the raw values are independent samples from some distribution when we compute time bucket anomaly scores

This adds significant complexity to sampling metrics and creates a disconnect between the data we show in visualisations and the data we use for anomaly detection. Furthermore, the independence assumption frequently does not hold. In this case our current behaviour can lead to false negatives. For data where outages are associated with a significant fall in data rate this is particularly problematic. The choice to try to correct the variance predated modelling periodic variance, which now better accounts for the most common case, that the data rate is periodic.

In this PR I have reverted to using the raw time bucket statistics for model update and anomaly detection. I rely on periodic variance estimation to deal with (common instances of) time varying data rate. This is a step towards #1386.

We model the level of a time series which we've observed having step discontinuities via a Markov process
for forecasting. Specifically, we estimate the historical step size distribution and the distribution of the steps
in time and as a function of the time series value. For this second part we use an online naive Bayes model
to estimate the probability that at any given point in a roll out for forecasting we will get a step.

This approach generally works well unless we're in the tails of the distribution values we've observed for
the time series historically when we roll out. In this case, our prediction probability are very sensitive to the
tail behaviour of the distributions we fit to the time series values where we saw a step and sometimes we
predict far too many steps as a result. We can detect this case: when we're in the tails of time series value
distribution.

This change does this and stops predicting changes in such cases, which avoids pathologies. This fixes #2466.
@tveasey tveasey changed the title Don't try and correct for sample count when estimating statistic variances for anomaly detection [ML] Don't try and correct for sample count when estimating statistic variances for anomaly detection Jun 3, 2024
@valeriy42 valeriy42 self-assigned this Jun 17, 2024
@valeriy42 valeriy42 marked this pull request as draft June 20, 2024 08:04
@valeriy42 valeriy42 added the ci:run-qa-tests Run a subset of the QA tests label Jun 20, 2024
@valeriy42 valeriy42 added v8.16.0 and removed v8.15.0 labels Jul 1, 2024
@valeriy42
Copy link
Contributor

valeriy42 commented Jul 4, 2024

Behaviour before the change:
image
Behaviour after the change:
image

I use the following script to generate synthetic data:

import pandas as pd
import numpy as np


def generate_variable_frequency_data():
    """Generate variable frequency throughput data with a failure scenario.

    Returns:
        pandas.DataFrame: A DataFrame containing the generated data with two columns:
            - '@timefield': Timestamps of the data points.
            - 'transaction_throughput': Throughput values at each timestamp.
    """
    # Define start and end dates
    start_date = pd.to_datetime("2024-04-01")
    end_date = pd.to_datetime("2024-04-21")  # 20 days period

    # Initialize lists to store timestamps and throughput values
    timestamps = []
    throughput_values = []

    # Initial timestamp
    current_time = start_date

    while current_time <= end_date:
        # Append the current timestamp
        timestamps.append(current_time)

        # Generate a throughput value with normal variability
        throughput = np.random.normal(200, 50)
        throughput = max(0, throughput)  # Ensure non-negative throughput
        throughput_values.append(throughput)

        # Generate the next timestamp using a sinusoidal frequency with noise with period of 24 hours
        base_frequency = 10  # base frequency in seconds
        sinusoidal_variation = 50 * np.sin(
            2 * np.pi * current_time.hour / 24
        )  # sinusoidal variation
        noise = np.random.normal(0, 5)  # noise
        interval = base_frequency + sinusoidal_variation + noise

        # Simulate a drop in frequency after a certain date
        if current_time > pd.to_datetime(
            "2024-04-18"
        ) and current_time < pd.to_datetime("2024-04-19"):
            interval *= 25  # Increase the interval by 2500%
            throughput_values[-1] = 0
        # Calculate the next timestamp
        current_time += pd.to_timedelta(abs(interval), unit="s")

    return pd.DataFrame(
        {"@timefield": timestamps, "transaction_throughput": throughput_values}
    )


if __name__ == "__main__":

    # Generate data
    data = generate_variable_frequency_data()

    # Save the data to a CSV file
    data.to_csv("variable_frequency_throughput_data.csv", index=False)

Hence, while data frequency is time-dependent, the metric value throughput is actually a normally distributed value. After the change, its confidence interval is correctly estimated, while before the change, the confidence bounds were changing over time, which is wrong.

\cc @tveasey

@valeriy42 valeriy42 removed the WIP label Jul 4, 2024
@valeriy42 valeriy42 marked this pull request as ready for review July 8, 2024 09:57
valeriy42 added a commit to elastic/elasticsearch that referenced this pull request Jul 9, 2024
While working on elastic/ml-cpp#2677, I encountered a failure in the integration test DetectionRulesIt.testCondition(). It checks the number of return records. With the new change in ml-cpp the native code returns two more values that have no significant score. I added filtering those out in the integration test code so it continues working as expected.
Copy link
Contributor Author

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finishing this Valeriy! One small suggestion, but LGTM.

include/core/CSmallVector.h Outdated Show resolved Hide resolved
@valeriy42 valeriy42 marked this pull request as draft September 10, 2024 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants