Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: Sep 1, 2020
Copyright: © 2020 Sebastien Roch
The setup We are given $n$ vectors $\mathbf{x}_1,\ldots,\mathbf{x}_n$ in $\mathbb{R}^d$. Our goal is to find a good clustering: loosely speaking, we want to partition these data points into $k$ disjoint subsets -- or clusters -- with small pairwise distances within clusters and large pairwise distances across clusters. To make this rather imprecise problem more precise, we consider a specific objective function known as the $k$-means objective.
Fix a number of clusters $k$. A partition of $[n] = \{1,\ldots,n\}$ of size $k$ is a collection of disjoint non-empty subsets $C_1,\ldots,C_k \subseteq [n]$ that covers all of $[n]$, that is, such that $\cup_{i=1}^k C_i = [n]$. The cost of $C_1,\ldots,C_k$ is then defined as
$$ \mathcal{G}(C_1,\ldots,C_k) = \min_{\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k} \sum_{i=1}^k \sum_{j \in C_i} \|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2. $$We think of $\boldsymbol{\mu}_i$ as the representative -- or center -- of cluster $C_i$. Note that $\boldsymbol{\mu}_i$ need not be one of the $\mathbf{x}_j$'s. Our goal is to find a partition that minimizes $\mathcal{G}(C_1,\ldots,C_k)$.
The $k$-means algorithm is a popular heuristic. It is based on the idea that the following two sub-problems are easy to solve:
One then alternates between the two.
The theory We proved the following.
Lemma (Optimal Representatives): Fix a partition $C_1,\ldots,C_k$. The optimal representatives are the centroids
$$ \boldsymbol{\mu}_i^* = \frac{1}{|C_i|} \sum_{j\in C_i} \mathbf{x}_j. $$Lemma (Optimal Clustering): Fix the representatives $\boldsymbol{\mu}_1,\ldots,\boldsymbol{\mu}_k$. An optimal partition is obtained as follows. For each $j$, find the $\boldsymbol{\mu}_i$ that minimizes $\|\mathbf{x}_j - \boldsymbol{\mu}_i\|^2$ (picking one arbitrarily in the case of ties) and assign $\mathbf{x}_j$ to $C_i$.
Theorem (Convergence of $k$-means): The sequence of objective function values produced by the $k$-means algorithm is non-increasing.
function opt_clust(X, k, reps)
n, d = size(X) # n=number of rows, d=number of columns
dist = zeros(Float64, n) # distance to rep
assign = zeros(Int64, n) # cluster assignments
for i = 1:n
dist[i], assign[i] = findmin([norm(X[i,:] .- reps[j,:]) for j=1:k])
end
@show G = sum(dist.^2)
return assign
end
opt_clust (generic function with 1 method)
function opt_reps(X, k, assign)
n, d = size(X)
reps = zeros(Float64, k, d) # rows are representatives
for j = 1:k
in_j = [i for i=1:n if assign[i] == j]
reps[j,:] = sum(X[in_j,:],dims=1) ./ length(in_j)
end
return reps
end
opt_reps (generic function with 1 method)
function mmids_kmeans(X, k; maxiter=10)
n, d = size(X)
assign = [rand(1:k) for i=1:n] # start with random assignments
reps = zeros(Int64, k, d) # initialization of reps
for iter = 1:maxiter
# Step 1: Optimal representatives for fixed clusters
reps = opt_reps(X, k, assign)
# Step 2: Optimal clusters for fixed representatives
assign = opt_clust(X, k, reps)
end
return assign
end
mmids_kmeans (generic function with 1 method)
We will look again at the classical iris dataset first analyzed by Fisher. We will upload the data in the form of a DataFrame
-- similar to a spreadsheet -- where the columns are different measurements (or features) and the rows are different samples.
df = CSV.read("iris-measurements.csv")
first(df, 5)
Id | PetalLengthCm | PetalWidthCm | SepalLengthCm | SepalWidthCm | |
---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1 | 1.4 | 0.2 | 5.1 | 3.5 |
2 | 2 | 1.4 | 0.2 | 4.9 | 3.0 |
3 | 3 | 1.3 | 0.2 | 4.7 | 3.2 |
4 | 4 | 1.5 | 0.2 | 4.6 | 3.1 |
5 | 5 | 1.4 | 0.2 | 5.0 | 3.6 |
describe(df)
variable | mean | min | median | max | nunique | nmissing | eltype | |
---|---|---|---|---|---|---|---|---|
Symbol | Float64 | Real | Float64 | Real | Nothing | Nothing | DataType | |
1 | Id | 75.5 | 1 | 75.5 | 150 | Int64 | ||
2 | PetalLengthCm | 3.75867 | 1.0 | 4.35 | 6.9 | Float64 | ||
3 | PetalWidthCm | 1.19867 | 0.1 | 1.3 | 2.5 | Float64 | ||
4 | SepalLengthCm | 5.84333 | 4.3 | 5.8 | 7.9 | Float64 | ||
5 | SepalWidthCm | 3.054 | 2.0 | 3.0 | 4.4 | Float64 |
X = reduce(hcat, [df[:,:PetalLengthCm], df[:,:PetalWidthCm],
df[:,:SepalLengthCm], df[:,:SepalWidthCm]]);
scatter(X[:,1], X[:,2],
legend=false, xlabel="PetalLength", ylabel="PetalWidth")
assign = mmids_kmeans(X, 2);
G = sum(dist .^ 2) = 622.7644623999738 G = sum(dist .^ 2) = 156.29984835441667 G = sum(dist .^ 2) = 152.53336910043723 G = sum(dist .^ 2) = 152.36870647733903 G = sum(dist .^ 2) = 152.36870647733903 G = sum(dist .^ 2) = 152.36870647733903 G = sum(dist .^ 2) = 152.36870647733903 G = sum(dist .^ 2) = 152.36870647733903 G = sum(dist .^ 2) = 152.36870647733903 G = sum(dist .^ 2) = 152.36870647733903
scatter(X[:,1], X[:,2], marker_z=assign, legend=false)
assign = mmids_kmeans(X, 3; maxiter=20);
G = sum(dist .^ 2) = 543.1623859092733 G = sum(dist .^ 2) = 101.63572149600577 G = sum(dist .^ 2) = 88.35799644444447 G = sum(dist .^ 2) = 85.04157943238867 G = sum(dist .^ 2) = 84.10217888865148 G = sum(dist .^ 2) = 83.13638186876973 G = sum(dist .^ 2) = 81.8390020677262 G = sum(dist .^ 2) = 80.895776 G = sum(dist .^ 2) = 79.96297983461302 G = sum(dist .^ 2) = 79.43376414532675 G = sum(dist .^ 2) = 79.01070972222222 G = sum(dist .^ 2) = 78.94506582597728 G = sum(dist .^ 2) = 78.94506582597728 G = sum(dist .^ 2) = 78.94506582597728 G = sum(dist .^ 2) = 78.94506582597728 G = sum(dist .^ 2) = 78.94506582597728 G = sum(dist .^ 2) = 78.94506582597728 G = sum(dist .^ 2) = 78.94506582597728 G = sum(dist .^ 2) = 78.94506582597728 G = sum(dist .^ 2) = 78.94506582597728
p3clust = scatter(X[:,1], X[:,2],
marker_z=assign, legend=false)
df_truth = CSV.read("iris-species.csv")
first(df_truth, 5)
Id | Species | |
---|---|---|
Int64 | String | |
1 | 1 | Iris-setosa |
2 | 2 | Iris-setosa |
3 | 3 | Iris-setosa |
4 | 4 | Iris-setosa |
5 | 5 | Iris-setosa |
species = df_truth[:,:Species]
unique(species)
3-element Array{String,1}: "Iris-setosa" "Iris-versicolor" "Iris-virginica"
species2number = Dict("Iris-setosa"=>1, "Iris-versicolor"=>2, "Iris-virginica"=>3)
truth = map(i -> get(species2number,i,0), species)
ptruth = scatter(X[:,1], X[:,2],
marker_z=assign, legend=false, seriescolor=:lightrainbow, title="truth")
plot(ptruth, p3clust, layout=(1,2), size=(750,375))