The $K$-means algorithm is a clustering algorithm in which centroids $\mathbf{c}_{i}$ are simply the arithmetic mean of points in the cluster $\mathcal{C}_{i}$. Accordingly, the loss to minimize is the average distance of a point $\mathbf{x}$ to the nearest centroid:
$$L(\mathcal{D})=\sum_{i=1}^{K}\sum_{\mathbf{x}\in\mathcal{C}_{i}}\left\Vert \mathbf{x}-\mathbf{c}_{i}\right\Vert .$$
The $K$-means algorithm only requires a number of clusters $K$ and random initial centroids as input. It then alternates between two steps:
- Place in cluster $\mathcal{C}_{i}$, the points nearest to the centroid $\mathbf{c}_{i}$.
- Set each centroid $\mathbf{c}_{i}$ to the arithmetic mean of the points in cluster $\mathcal{C}_{i}$.
Below we give the complete algorithm and show an example of convergence on a toy dataset.
Inputs: | $K$, the number of clusters. $\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}$, initial set of centroids at step $0$. |
Outputs: | $\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}$, final set of centroids. |
Variables: | $C=\{\mathcal{C}_{1},\mathcal{C}_{2},\dots,\mathcal{C}_{K}\}$, the set of sets $\mathcal{C}_{i}$ containing the points closest to $\mathbf{c}_{i}$. |
until satisfied:
for $i$ from $1$ to $K$:
$\mathcal{C}_{i}:=\{\mathbf{x}\in\mathcal{D}|\mathbf{c_{i}}$ is the centroid closest to $ \mathbf{x}\}$
for $i$ from $1$ to $K$:
$\mathbf{c}_{i}=\frac{1}{\left|\mathcal{C}_{i}\right|}\sum_{\mathbf{x}\in\mathcal{C}_{i}}\mathbf{x}$
return $\{\mathbf{c}_{1},\mathbf{c}_{2},\dots,\mathbf{c}_{K}\}$.
end
Figure 4: Convergence of the $K$-means algorithm on a toy $2$-dimensional clustering dataset with $K=3$. The dataset set is given in (a). The centroids (denoted by $\circ$) and the points closest to them (denoted by $\times$) are represented with the same color in steps 1 to 3.
Although we show an example of convergence, the $K$-means algorithm does not always converge to an appropriate solution as it can converge to a local minimum.
The $K$-means algorithm can be suited to extract features for classification, however probabilistic models can be much more powerful to capture complex structure in data.