-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Don't try and correct for sample count when estimating statistic variances for anomaly detection #2677
base: main
Are you sure you want to change the base?
Conversation
We model the level of a time series which we've observed having step discontinuities via a Markov process for forecasting. Specifically, we estimate the historical step size distribution and the distribution of the steps in time and as a function of the time series value. For this second part we use an online naive Bayes model to estimate the probability that at any given point in a roll out for forecasting we will get a step. This approach generally works well unless we're in the tails of the distribution values we've observed for the time series historically when we roll out. In this case, our prediction probability are very sensitive to the tail behaviour of the distributions we fit to the time series values where we saw a step and sometimes we predict far too many steps as a result. We can detect this case: when we're in the tails of time series value distribution. This change does this and stops predicting changes in such cases, which avoids pathologies. This fixes #2466.
…metric-statistic-modelling-part-1
Behaviour before the change: I use the following script to generate synthetic data: import pandas as pd
import numpy as np
def generate_variable_frequency_data():
"""Generate variable frequency throughput data with a failure scenario.
Returns:
pandas.DataFrame: A DataFrame containing the generated data with two columns:
- '@timefield': Timestamps of the data points.
- 'transaction_throughput': Throughput values at each timestamp.
"""
# Define start and end dates
start_date = pd.to_datetime("2024-04-01")
end_date = pd.to_datetime("2024-04-21") # 20 days period
# Initialize lists to store timestamps and throughput values
timestamps = []
throughput_values = []
# Initial timestamp
current_time = start_date
while current_time <= end_date:
# Append the current timestamp
timestamps.append(current_time)
# Generate a throughput value with normal variability
throughput = np.random.normal(200, 50)
throughput = max(0, throughput) # Ensure non-negative throughput
throughput_values.append(throughput)
# Generate the next timestamp using a sinusoidal frequency with noise with period of 24 hours
base_frequency = 10 # base frequency in seconds
sinusoidal_variation = 50 * np.sin(
2 * np.pi * current_time.hour / 24
) # sinusoidal variation
noise = np.random.normal(0, 5) # noise
interval = base_frequency + sinusoidal_variation + noise
# Simulate a drop in frequency after a certain date
if current_time > pd.to_datetime(
"2024-04-18"
) and current_time < pd.to_datetime("2024-04-19"):
interval *= 25 # Increase the interval by 2500%
throughput_values[-1] = 0
# Calculate the next timestamp
current_time += pd.to_timedelta(abs(interval), unit="s")
return pd.DataFrame(
{"@timefield": timestamps, "transaction_throughput": throughput_values}
)
if __name__ == "__main__":
# Generate data
data = generate_variable_frequency_data()
# Save the data to a CSV file
data.to_csv("variable_frequency_throughput_data.csv", index=False) Hence, while data frequency is time-dependent, the metric value \cc @tveasey |
…metric-statistic-modelling-part-1
While working on elastic/ml-cpp#2677, I encountered a failure in the integration test DetectionRulesIt.testCondition(). It checks the number of return records. With the new change in ml-cpp the native code returns two more values that have no significant score. I added filtering those out in the integration test code so it continues working as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finishing this Valeriy! One small suggestion, but LGTM.
…metric-statistic-modelling-part-1
Currently, we:
This adds significant complexity to sampling metrics and creates a disconnect between the data we show in visualisations and the data we use for anomaly detection. Furthermore, the independence assumption frequently does not hold. In this case our current behaviour can lead to false negatives. For data where outages are associated with a significant fall in data rate this is particularly problematic. The choice to try to correct the variance predated modelling periodic variance, which now better accounts for the most common case, that the data rate is periodic.
In this PR I have reverted to using the raw time bucket statistics for model update and anomaly detection. I rely on periodic variance estimation to deal with (common instances of) time varying data rate. This is a step towards #1386.