Introduction
In the next few posts I want to take you on the learning journey that happened when I tried to wrap my head around this wonderful paper by Brian Kulis and Michael I. Jordan (yes, Michael Jordan). The paper is about a variant of the k-means algorithm inspired by a Bayesian way of thinking about clustering. This helps solve a significant problem with k-means, namely that we don’t know the optimal value of $k$ in advance, and, furthermore, that there often is not even a way to make a reasonable guess of what a good value might be for a particular dataset.
In this post, we will go over the contents of section 2.1 of the paper, which is about a connection between k-means and a mixture of Gaussians model. In future posts, we will introduce a Bayesian approach to clustering, and see how that perspective is more than just philosophical; it can help us develop a new algorithm that is more flexible than k-means.
If you’re not familiar with k-means, I’d recommend reading up on it before proceeding. The Wikipedia page is a good place to start.
Background
We are going to develop the k-means algorithm in a somewhat non-standard way, but first we need to review a few preliminaries.
Mixtures of Gaussians
A mixture of Gaussians model can be succinctly expressed as follows:
This expression says that under our model, the probability assigned to sampling a point near $x$ is a mixture of the probabilities of sampling a point near $x$ according to each of $k$ different Gaussians. The mixing coefficient $\pi_i$ represents the amount of weight we place on the $i$th Gaussian (they are nonnegative and sum to 1), and $\mu_i$ and $\Sigma_i$ are the mean( vector)s and covariance( matrice)s, respectively.
One way to interpret this model that I find intuitively appealing is as an assumption that each point in your dataset is generated from one of $k$ distinct (Gaussian) generating processes. To sample a new point from this distribution, you can first sample a value $i$ using the $\pi_i$, and then subsequently sample a point from the corresponding Gaussian. (Note that this assumption may or may not be suitable for a particular dataset or application. As George Box has famously said: “All models are wrong, but some are useful.”)
EM algorithm
The EM algorithm is a way of carrying out maximum likelihood estimation in cases where the model is constructed with latent variables. Latent variables are often used to account for or represent characteristics of a population that cannot be directly observed. As an example, someone’s political orientation might not be directly observable, but if you ask questions about certain issues, you might be able to infer it.
The algorithm tries to find an optimal pair of quantities: the values of the latent variables and the parameter values. To do this, we alternate between two steps
Expectation step
The output of this step is a function that takes a potential parameter setting (i.e., a model) $\theta$, and outputs a score. (Higher scores are better.) The obvious question: How do we compute that score? Let’s say that $z$ is a possible latent variable setting. The score corresponding to $z$ is $\log p(x, z | \theta)$, i.e., the (log) likelihood of observing the data $x$ and $z$ assuming the model $\theta$. Since a “probable” setting of $z$ depends on the parameters we’ve estimated so far and the observed data $x$, the overall score for $\theta$ takes a weighted average of the scores across possible values of $z$, where the weights are the likelihood of observing $z$ given the latest parameters and the data $x$.
To summarize using notation we’ll refer to in the next section, the output of this step is a function we’ll call $Q$. To formalize the idea in the previous paragraph, we define $Q$ as follows:
Here, $\theta^{(t)}$ is our running parameter estimate; the $t$ is a subscript, not an exponent. At each step, we have access to a concrete value of $\theta^{(t)}$. As the algorithm progresses, we update $\theta^{(t)}$ to $\theta^{(t+1)}$ and so on. In contrast to $\theta^{(t)}$, $\theta$ is a placeholder for a parameter setting that we are constructing $Q$ to evaluate. In the same way that when we define $f(x)$, we don’t know anything about $x$ except that it’s an input to the $f$, similarly, when we work with $Q(\theta; x, \theta^{(t)})$, we don’t know anything about $\theta$ except that it’s a potential parameter setting we might want to test.
The semicolon in the definition of $Q$ is just a way of indicating that the function also depends on the data $x$ and the latest parameter setting $\theta^{(t)}$, but that we’re treating these as known (whereas $\theta$ is a variable).
Maximization step
Once we have the function $Q$, we simply maximize the function over $\theta$. The concrete update rule depends on the model we’re working with, but the general idea is to find the parameter setting that makes the data most likely. In general mathematical terms we are looking for
We alternate between the expectation step (forming $Q$ using $\theta^{(t)}$) and maximization step (finding the parameter value that maximizes $Q$) until the algorithm converges.
K-means = EM + mixture of Gaussians + a limit
For each point $x_i$, we can imagine that there is a true, but latent, cluster assignment vector $z_i \in \text{one-hot}(k)$. Each entry $z_{ic}$ of $z_i$ is 1 if $x_i$ is a member of cluster $c$ and 0 otherwise. (Note that only one component of $z_i$ can be 1 for each $i$.) In our setup, since the cluster assignment is latent, $x_i$’s cluster membership is smeared across the different clusters probabilistically. We express this idea of a soft assignment by saying that there is a probability, $\gamma(z_{ic})$, that the point $x_i$ is in cluster $c$. (In contrast, k-means makes a hard assignment, where each point is assigned to exactly one cluster, in which case $\gamma$’s output would be binary. We’ll get to that in a moment.)
Let’s assume that the $x_i$ are drawn from a mixture of Gaussians with mixture coefficients $\pi_i$ for $i = 1,\dots, k$, but with the additional assumption that for each $i$, $\Sigma_i = \sigma I$. That is, assume that the covariance matrices for each Gaussian are all diagonal and have the same value $\sigma$ on their diagonals.
The entries of the $\gamma(z_i)$ (the vector of $\gamma(z_{ic})$ for $c = 1, \dots, k$, I apologize for the abuse of notation) can be expressed as
We write $\gamma(z_{ij}; \theta^{(t)})$ to emphasize that the $\gamma$’s are computed using the latest parameter estimates $\theta^{(t)}$.
The denominator here looks complicated, but it’s just there to make sure that the entries of $\gamma(z_i)$ sum to 1. The numerator is roughly the probability that $x_i$ was generated by the $j$th Gaussian. (The exact expression is a bit more complicated, but this is the intuition.)
The k-means algorithm can be seen as a limiting case of the EM algorithm applied to this mixture of Gaussians setup.
The E-step
The TL;DR is that the E-step of the EM algorithm computes the $\gamma(z_i)$ for each $x_i$ using the existing $\mu_j$s and $\pi_j$s. I want to be slightly more precise, though, about how this fits within the abstract framework of the EM algorithm that we outlined earlier.
The E-step computes $\gamma(z_i)$ as above. To map this onto the abstract definition of the E-step, we can first note that for a particular example $x_i$, we can compute the (log) likelihood of $x_i$ being generated by the $j$th Gaussian as
Here, $\theta$ is the set of all parameters to evaluate, i.e., the $\pi_i$ and $\mu_i$. To compute the expected value of this quantity for $x_i$ over all possible latent variable settings, we can use the $\gamma(z_{ij})$ we computed above to weight the contribution of the log probability for each latent setting $z$:
To extend this expectation to the entire dataset, we can simply sum over all the $x_i$:
Note that in the equation (1), the $\mu_j$ and $\pi_j$ are the parameters that we are evaluating with the function $Q$, whereas the latest means and mixture coefficients we’ve estimated so far (i.e., those in represented by $\theta^{(t)}$) are used to compute the $\gamma(z_{ij})$.
The M-step
Since we’ve fixed the covariances, i.e. are not parameters that need to be found, the M-step updates the $\mu_j$ to be the weighted average of all of the $x_i$. Each point’s contribution to the computation of $\mu_j$ is weighted by the corresponding $\gamma(z_{ij})$, or how strongly it is attracted to $\mu_j$. This makes intuitive sense; the farther away a point is from a given mean, the (exponentially) less impact it has on that mean’s update.
Putting it all together
The final ingredient we need to make the rigorous connection to k-means is to note what happens when $\sigma$ gets smaller and smaller. As we assume that the $k$ Gaussians in our mixture become narrower and narrower, points that are not very close to the means have their likelihoods exponentially decay to 0. Since the sum in the denominator of the expression for $\gamma(z_{ij})$ becomes dominated by the term with the smallest distance to $\mu_j$, as $\sigma \to 0$, $\gamma(z_{ij})$ tends to 1 when $j$ minimizes $||x_i - \mu_j||^2$ and to 0 otherwise. This “one-hot”ing of the $\gamma(z_i)$ is the same as saying that letting $\sigma$ go to 0 is turns our soft assignment into a hard assignment, which is exactly what k-means is designed to do.
To summarize with what is truly an unpleasant mouthful: The k-means algorithm can be seen as the application of the EM algorithm to a mixture of Gaussians when we assume that the covariance matrices are fixed to $\sigma I$ and we let $\sigma$ go to 0.
Conclusion
That’s all for now. I’ve always find it quite satisfying to see how different, seemingly disparate theorems, algorithms, models, etc. can come together in unexpected ways.
In the next post, we’ll start to develop our Bayesian perspective to develop the new clustering algorithm I mentioned in the intro. Stay tuned!