You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ask the question
I'm training a model using the Tribuo HDBscan algorithm and then predicting new values with this model to search for anomalies in my data. However, when retrieving the predictions score, I'm getting back extremely wrong values.
To make myself clear, I'm training the model using segments containing statistic values of my data (standard deviation, mean, average and so on). Each segment refers to a certain timeframe (10 AM to 11 AM, 11 AM to 12 AM and so on).
When predicting the data, I do the same with the test set, grouping it in different timeframes and calculating the same statistic values used during the training phase.
Still, after using the predict method, even if the test dataset has a magnitude much more higher than the one used for training the model, the score didn't underline the distance between the data. I would've expected that these test values got the maximum outlier score since they are so distant from the trained data.
Is there something wrong with my approach?
This code shows how we created the Datasource for training and test set.
Is the issue is that at prediction time some examples which are far from the training data are being assigned to a non-noise cluster? Or are they being assigned to the noise cluster, but have strange outlier scores? I think the outlier scores are fixed and based on the largest MST edge weight, so they might not be too useful currently.
Ask the question
I'm training a model using the Tribuo HDBscan algorithm and then predicting new values with this model to search for anomalies in my data. However, when retrieving the predictions score, I'm getting back extremely wrong values.
To make myself clear, I'm training the model using segments containing statistic values of my data (standard deviation, mean, average and so on). Each segment refers to a certain timeframe (10 AM to 11 AM, 11 AM to 12 AM and so on).
When predicting the data, I do the same with the test set, grouping it in different timeframes and calculating the same statistic values used during the training phase.
Still, after using the
predict
method, even if the test dataset has a magnitude much more higher than the one used for training the model, the score didn't underline the distance between the data. I would've expected that these test values got the maximum outlier score since they are so distant from the trained data.Is there something wrong with my approach?
This code shows how we created the Datasource for training and test set.
To define our custom Datasource, we took this class as an example: https://github.com/oracle/tribuo/blob/407af05654dabdeed06c4439333db89bae6cc9d9/Clustering/Core/src/main/java/org/tribuo/clustering/example/GaussianClusterDataSource.java
One of our doubts is the assignement of
clusterID
for each segment.Here is an image showing what the Datasource instance (before training) contains in debugger mode:
Is your question about a specific ML algorithm or approach?
I'm using the HDBScan algorithm
Is your question about a specific Tribuo class?
HDBScanModel
andDataset<ClusterID>
System details
The text was updated successfully, but these errors were encountered: