Inertia in k-means clustering #594

frederik-f · 2023-12-05T17:01:43Z

frederik-f
Dec 5, 2023

Dear all,

I'm using the k-means algorithm for clustering a set of curves. Does anybody know, how the inertia is exactly calculated in this case? Suppose I have the grid_points as [0, 1, 2, 3] and the data_matrix as [[1, 2, 3, 4], [6, 5, 4, 3], [5, 3, 1, -1]]. If I'm using 2 clusters, I would e.g. retrieve the cluster_centers = [[5.5, 1],[4, 2],[2.5, 3],[1, 4]] and an inertia of 10.5, which is the sum of the squared distances. The single distances read [0., 2.29128785, 2.29128785]. I would really appreciate, if anybody could tell me how these distances are calculated exactly, ideally for this specific example.

Thanks a lot in advance!

Answered by vnmabus

Dec 7, 2023

In your case you have the discretization grid $\mathbf{t} = (0, 1, 2, 3)$ and the functions $x_0, x_1, x_2$ with $x_0(\mathbf{t}) = (1, 2, 3, 4)$, $x_1(\mathbf{t}) = (6, 5, 4, 3)$ and $x_2(\mathbf{t}) = (5, 3, 1, -1)$.
The cluster centers are $c_0, c_1$ with $c_0(\mathbf{t}) = (5.5, 4, 2.5, 1)$ and $c_1(\mathbf{t}) = (1, 2, 3, 4)$.
The distance between two functions $f$ and $g$ is configurable using the metric parameter. By default is the $L^2$ distance, that is $d(f, g) = \sqrt{\int_{\mathcal{T}} |f(t)-g(t)|^2dt}$ (thus, analog to the Euclidean distance).
The distances appearing in the inertia are those between each observation and the cluster it belongs to (the closest one). Thus:
$d_0 …

View full answer

vnmabus · 2023-12-07T22:04:32Z

vnmabus
Dec 7, 2023
Maintainer

In your case you have the discretization grid $\mathbf{t} = (0, 1, 2, 3)$ and the functions $x_0, x_1, x_2$ with $x_0(\mathbf{t}) = (1, 2, 3, 4)$, $x_1(\mathbf{t}) = (6, 5, 4, 3)$ and $x_2(\mathbf{t}) = (5, 3, 1, -1)$.
The cluster centers are $c_0, c_1$ with $c_0(\mathbf{t}) = (5.5, 4, 2.5, 1)$ and $c_1(\mathbf{t}) = (1, 2, 3, 4)$.
The distance between two functions $f$ and $g$ is configurable using the metric parameter. By default is the $L^2$ distance, that is $d(f, g) = \sqrt{\int_{\mathcal{T}} |f(t)-g(t)|^2dt}$ (thus, analog to the Euclidean distance).
The distances appearing in the inertia are those between each observation and the cluster it belongs to (the closest one). Thus:
$d_0 = d(x_0, c_1) = 0$ (because both are equal)
$d_1 = d(x_1, c_0) \sim 2.29128785$
$d_2 = d(x_2, c_0) \sim 2.29128785$
In all cases the integral is approximated using a Simpson quadrature rule, as implemented in scipy.integrate.simpson.

1 reply

frederik-f Dec 8, 2023
Author

Thanks a lot for your quick response and detailed explanation, which fully resolved my question!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inertia in k-means clustering #594

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Inertia in k-means clustering #594

frederik-f Dec 5, 2023

Replies: 1 comment · 1 reply

vnmabus Dec 7, 2023 Maintainer

frederik-f Dec 8, 2023 Author

frederik-f
Dec 5, 2023

Replies: 1 comment 1 reply

vnmabus
Dec 7, 2023
Maintainer

frederik-f Dec 8, 2023
Author