-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize feature selection for clustering (or in general finding attractors) #85
Comments
Hey, great! I think it's a cool project, which can extend to finding relevant features in general for time-series. Our experience from using DBSCAN to find attractors is that it can be quite unclear when the clustering works. We had mainly to rely on the silhouettes index, which is also far from perfect, and quite unintuitive. Another complication is that it would be very expensive to find the most efficient set of features from the clustering by doing this sampling from lots of features. So maybe it would be nice to come up with a technique that doesn't rely on clustering like DBSCAN. In the systems I've been studying, it hasn't been tremendously difficult to find nice features that are constant, or at least stationary, on the attractor, so that their time-averages were well-behaved, and ics on the same attractor were basically the same. In this scenario, feature space is composed of very densely populated clouds (almost single points) that are all well separated between themselves, and a simple histogram (or the similar idea I did in PR #84) works nicely. Maybe a similar idea could work here, and one could look at the histograms on individual features and from there decide the best ones. Not sure what the best metric would be, though. Also not sure in general how robust this would be to noisy data.
I wonder how well the clustering would work if applied on all the available features simultaneously, i.e. on this high-dimensional feature space. If it works well (the histogram would, I guess), then it could be taken as a ground truth to then compare the single-feature analysis to. Some additional thoughts: Be aware of features spanning different orders of magnitude, especially if you are computing distances. Features going in the interval (0, 1e3) will over-contribute compared to features in (0, 1e-1) for instance. A problem we haven't yet solved is how to properly deal with this. The current A useful property of the features that could enter the metric is if they are constant, or at least stationary in time. Would be happy to follow this and discuss more at some point! PS: sorry for the delay, had a trip this week. |
I guess you want to do use some Integer programming technique. The setup is as follows: for 20 features you have a vector of 20 value {0,1} where 1 means "use this parameter" and 0 "discard this one". The program search the vector of {0,1} that optimizes the fitness function (the clustering metric}. It will select the features that optimizes the metrics. The hard stuff is to define the right clustering quality metric.
The optimization to select the best features can be done with DBSCAN on a small random sample of vectors. The optimization shouldn't take long if the problem is convex. If it is not convex it is the worst case scenario and you may find a suboptimal solution. |
In my application this is not a problem. I have about 100 trajectories in total.
What do you mean? Why?
But I haven't really understood of a way to judge the quality of a clustering in your answer. So, there is therefore no way to compare whether some features are good or not. You say "compare" above, but how? Technically how. As I said in my original post, I need of a way to numerically estimate "goodness" of a clustering output. I guess I'll open a post on discourse and see if someone from statistics has an idea of how to do this. Alex put it very well:
Yes, but this is the easy part. I can just use Combinatorics.jl to make an iterator for me. |
@KalelR apparently the metric I am searching for is just the silhuette (see the post above). But I remember that here we optimize differently using Optim.jl . So I guess the numerical quantity I am searching for is just the minimum of the optimizer here: https://github.com/JuliaDynamics/Attractors.jl/blob/main/src/mapping/grouping/cluster_optimal_%CF%B5.jl#L116 So I guess I should just do a simple PR that makes it so that this information is stored somewhere in the |
I did some analysis here. I am keeping tract of the It appears to me that just this by itself is not enough. Here are some examples from the data I am using: (left are the timeseries. The features are always the standard deviation of the timeseries for now. The dataset has 3 attractors, we know this a-priori. We want to see which observables (i.e., "features") are the ones that separate the attractors best) optimal val = 0.84 optimal val = 0.87 optimal val = 0.95 this image is pretty interesting because it shows that only one feature is sufficient to cluster this data into 3 attractors (you will notice that from these 3 features they are almost perfectly linearly correlated) My results are that the optimal sillhuete value indeed correlates with "cluster separation quality", however this metric by itself is probably not enough. E.g., in the following example I have the least amount of optimal value, yet still find 3 clusters: optimal value = 0.8 My main difficulty is that the silhuette quality metric has so little variability in it. I need to process this. Ideally I want a metric that rewards both cluster quality and cluster amount. And it also rewards using less feature dimensions. I want to test if just using the formula "optimal value * number of attractors / number of feature dimensions" is enough. |
I have an idea of how we could make a function to optimize feature selection for clustering attractors. I will be creating this function while working on a new research project. But I think it is useful to discuss possibilities of how to make it happen.
My idea is as follows:
The main question therefore is how do we define a "cluster quality metric"? On one hand, this metric needs to have a contribution of cluster separation: the more separated the clusters are, the more clear the clustering. This could be done by computing the pairwise distribution distance of the pdfs fitted to every cluster, and keep as the metric the median of this distribution.
However, this cluster quality metric should also have some short of contribution from the amount of clusters found. The more clusters the better...?
cc also @KalelR I think this is relevant for you too.
The text was updated successfully, but these errors were encountered: