Machine-Learning on Jack Gindi

Why the sigmoid?

Wed, 26 Nov 2025 00:00:00 +0000

Introduction

Logistic regression is often the first classification algorithm a machine learning engineer encounters. It is the workhorse of binary classification; it’s simple, interpretable, and surprisingly effective. If you have taken a standard introductory course, you have likely been presented with the model definition as a fait accompli: take a linear predictor, wrap it in a sigmoid function, and voila, you have a probability.

But why? Why the sigmoid function specifically? Why not use a clipped straight line, or a different S-shaped curve like the cumulative normal distribution? Is it just an arbitrary choice that happens to work well, or is there something fundamental about the mathematics that demands this specific shape?

Logistic regression

For various reasons we will not cover here, directly using linear regression for classification does not work well. If $x$ is $d+1$-dimensional, and $\beta = (\beta_0, \dots, \beta_{d})$ are model parameters that we adjust during training, then we can take our linear model $\beta^\top x$ and force it to only take values between 0 and 1 by pushing it through the sigmoid function. In other words, we model the probability that $y$ is 1 given features $x$ as

$$ p(y=1|x) \approx \frac{1}{1 + \exp(-\beta^\top x)} = \sigma(\beta^\top x). $$

The sigmoid function $\sigma$ has an output range between 0 and 1, but there have to be other such functions, right? Why use this one and not some other one? In the remainder of this post, we look at two ways to understand why the choice of sigmoid is deeper than it seems.

Range mapping

(For this section, we will refer to $p(y=1|x)$ as $p$.)

Assuming we want to leverage linear regression as much as possible, we want to find something that we can model using $\beta^\top x$. The most obvious thing we could try first is to model the probability directly:

$$ p \approx \beta^\top x. $$

One simple reason this doesn’t work is just that the output ranges don’t match. To put a finer point on it, since $p$ is a probability, it can only take values between 0 and 1, whereas $\beta^\top x$ can take any real value. The plot below helps to visually understand our problem:

Because of this, we need to find something else we can model with $\beta^\top x$.

Instead of modeling $p$, what if we modeled $p/(1-p)$, a quantity known as the odds? Things get a little bit better here, since the odds range from 0 (when $p$ is very small) to $+\infty$ when ($p\approx 1$). However, remember that $\beta^\top x$ can take any real value, including negative values all the way to $-\infty$! So while this is a little bit better, it still doesn’t quite work.

One thing to notice about the odds is that they cover the domain of the $\log$ function. So what if we modeled the log odds $p / (1-p)$ instead of the odds? In this case, the ranges match! When the odds are close to 0, the log odds approach $-\infty$. As the odds approach $\infty$, so do the log odds! At this point, our model is

$$ \log\biggr( \frac{p}{1-p} \biggr) \approx \beta^\top x $$

We can solve this for $p$ using some simple algebra to see that this is equivalent to a model for $p$. To do this, first we exponentiate both sides

$$ \frac{p}{1-p} \approx \exp(\beta^\top x) $$

Next, we multiply both sides by $1-p$ and distribute on the right.

$$ p \approx \exp(\beta^\top x) - p\exp(\beta^\top x) $$

Next, we move the $p$ terms to the left side and pull them out to get

$$ p (1 + \exp(\beta^\top x)) \approx \exp(\beta^\top x) $$

Finally, dividing by $1 + \exp(\beta^\top x)$, we have

$$ p \approx \frac{\exp(\beta^\top x)}{1 + \exp(\beta^\top x)}. $$

We can see that this is actually the sigmoid by dividing the numerator and denominator by $\exp(\beta^\top x)$, which gives us our the final form of our model for $p$:

$$ p \approx \frac{\exp(\beta^\top x) / \exp(\beta^\top x)}{1 / \exp(\beta^\top x) + \exp(\beta^\top x) / \exp(\beta^\top x)} = \frac{1}{\exp(-\beta^\top x) + 1} = \sigma(\beta^\top x). $$

(Remember that $1/\exp(\beta^\top x) = \exp(-\beta^\top x)$.)

This derivation is intuitive and visually satisfying. It solves the “unbounded output” problem by mapping the infinite range of a linear model to the unit interval of a probability. However, a skeptical reader might still ask: “Why did we choose to model the log-odds specifically?”

While the log-odds are a convenient choice for range mapping, they are not the only choice. We could have chosen other functions to map $(-\infty, \infty)$ to $[0,1]$ (such as the cumulative distribution of a Gaussian). To understand why the sigmoid is not just a convenient choice, but the mathematically “natural” choice, we need to dig a bit deeper.

The exponential family

Another, perhaps more principled way to arrive at the sigmoid function is to make an assumption about the conditional distribution of the target $Y$ given the input $X$. If we assume that

$$ Y|x \sim \text{Bern}(p(\beta^\top x)), $$

i.e., that given a value for $x$ and parameters $\beta$, that $Y$ is a Bernoulli random variable with success probability $p(\beta^\top x)$, then we can write the probability mass function for this distribution

$$ P(Y|X=x; p(\beta^\top x)) = \begin{cases} p(\beta^\top x) &\text{ if } Y=1 \\ 1 - p(\beta^\top x) &\text{ if } Y=0 \end{cases} $$

From this point forward, we will make our notation less cumbersome and refer to $p(\beta^\top x)$ simply as $p$.

There’s a more compact, cleverer way of writing $P(Y|X=x; p)$ that packs both cases from the previous formulation into a single expression:

$$ P(Y=y|X=x; p) = p^y (1 - p)^{(1-y)}. $$

The key is that $y$, which is binary, acts as a switch that turns on the relevant term for each case. When $y = 1$, $1-y=0$, so $P(Y=1|X=x; p)$ resolves to $p$. Similarly, when $Y = 0$, the expression resolves to $1 - p$.

Let’s first rewrite

$$ \begin{align*} P(Y=y|X=x; p) &= p^y (1 - p)^{(1-y)} \\ &= \exp(\log(p^y (1 - p)^{(1-y)})) \\ &= \exp(y\log(p) + (1 - y)\log(1 - p)) \\ &= \exp(y\log(p) + \log(1 - p) -y\log(1 - p)) \\ &= \exp((\log(p) - \log(1 - p))y + \log(1 - p)) \\ &= \exp(y\log(p/(1 - p)) + \log(1 - p)) \end{align*} $$

This is great! But why go through this derivation?

It reveals that the log-odds are what is known as the natural parameter of the Bernoulli distribution. By setting our linear predictor $\beta^\top x$ equal to this natural parameter, we are using what is called the canonical link function. This specific choice is mathematically “safe”: it guarantees that the log-likelihood function with respect to the parameters $\beta$ is concave. In practical terms, this makes the likelihood much easier to maximize, since ther are no local optima to get stuck at. To see the model we get for $p$ when we model the log odds as $\beta^\top x$, we… just have a look at the earlier section on range mapping.

Conclusion

In this post, we’ve seen two ways of motivating the use of the sigmoid function in logistic regression. The first is easier to grasp and more intuitive, but the second gives a glimpse of the mathematical depth that is often skipped over when engineers study ML for the first time. I hope you enjoyed, and happy Thanksgiving!

Self-supervised learning

Wed, 17 Jul 2024 00:00:00 +0000

Introduction

Many of the most powerful models you see out there today are built using a two-stage approach:

Pre-training: A model learns generally useful representations (e.g., of language or images) without reference to any particular task (e.g., sentiment analysis or image classification).
Fine-tuning: Using the pre-trained model as a starting point, we fine-tune the model to accomplish some additional objective, such as safety or the ability to perform some downstream task.

In theory, the benefit of this approach is that once you have a powerful pre-trained model, you can build many different task-specific and/or fine-tuned models on top of it with relatively little additional effort. But this benefit is only realized if our pre-trained model has comprehensively discovered useful features of language, images, or both, which requires TONS of diverse, carefully curated data. From a machine learning perspective, this presents some thorny questions:

It’s one thing to train a model for something specific, like classifying X-ray images of tissue as cancerous or not, but how do we set up a training objective that helps a model learn… well… very general and woefully underspecified “useful features of languages, images, or both”?
The most widely used and successful traditional machine learning approaches have in the past required data that has been labeled, or annotated with correct outputs. But if we want to unlock the use of larger and larger datasets for training, obtaining sufficient high-quality labels can be prohibitively expensive, if it’s possible at all. Can we accomplish our pre-training without hand-labeled data?
Even assuming we could get whatever quantity of labeled data we needed, what kinds of labels would be useful in gaining a general understanding of language or visual data?

One paradigm that has recently come to the fore is call self-supervision, wherein instead of hand-labeling each example, we actually generate the training signal from the raw data itself. To understand the significance of self-supervision, let’s first revisit the foundations of fully supervised learning. This will set the stage for appreciating how self-supervised approaches build upon and extend these principles to overcome the challenges of labeled data scarcity.

Supervised learning

From thirty-five thousand feet, supervised learning works like this: We have a dataset of $(x, y)$ pairs, where the $x$s are called examples, and the $y$s are called labels. For instance, $x$s might be an image and $y$ might be a binary label that is 1 if the image contains a cat and 0 otherwise. In a supervised setting, during training, the model makes a cat/non-cat prediction for an image $x$, and then that prediction is checked against the label $y$. The closer the prediction – which is usually a probabilistic score rather than binary – is to the label, the less our parameters need to move in response to this $(x, y)$ pair.

The key here – and the problem – is that for the above to work, we need a $y$ for every $x$! As noted earlier, depending on the problem, labels can be very expensive to generate. For object recognition tasks, for example, we need every object of interest to have been outlined and labeled in every image… across hundreds of thousands or millions of images and across hundreds or thousands of different object categories! At scale, this is both time consuming and hard to do correctly.

To make the jump over to self-supervision, we need to come up with useful objectives or tasks that models could use during training that only require $x$s. As a first example of how we can generate a training signal from unlabeled data, let’s talk about autoencoders.

Autoencoders

An autoencoder’s job is to take some big input thing and make it smaller. That is, we use autoencoders to take some complex object and turn it into a (relatively) low-dimensional vector of numbers (an embedding) that contains most of its relevant semantic content and is neural-network-compatible. Assuming the representation captures important information, these representations can then be used for downstream tasks like finding farms in satelite photos, or finding duplicate photos in a photo library.

We do this compression in the following way. Let’s say you want to compress an object $x$. The autoencoder has two components:

An encoder $E$: maps from our high-dimensional input space into our embedding space.
A decoder $D$: this piece maps from the embedding space back into the input space.

(Note: Both $E$ and $D$ are smaller neural networks whose parameters are updated during training. Once the system has been trained, we typically throw out $D$ and just use $E$ to generate embeddings.)

During training, the model is shown many examples $x$. For each one, it computes $e = E(x)$, and then $x' = D(e)$. Here, $e$ is what we call the embedding of the object. With autoencoders, we use the original $x$ as our “label”! The way we determine whether or not our embedding $e$ is good is by whether it contains enough information to be re-expanded back into $x$! Mathematically, our representation is good if $\text{distance}(x, x')$ is small.

This is our first example of actually manufacturing a supervised problem from unlabeled data by cleverly defining our objective. In this case, we define our training objective to be that our reproductions of inputs (images, for instance) should be as close as possible to the original inputs themselves. By using this training task, we obviate the need for labels, but are nonetheless able to learn something useful.

Next, we’ll look at a pair of training tasks used in conjunction to train a model called BERT, a precursor to many of the large language models available today.

Masked language modeling (MLM) and next sentence prediction (NSP)

In 2018, Google released a now-famous paper detailing a transformer-based language model called BERT. What is interesting about it for our purposes is the authors’ choice of training tasks. Here, again, as we’ll discuss, they found ways to generate a helpful training signal from the raw data itself, which in this case is text data. We will discuss each of their two tasks in turn: masked language modeling and next sentence prediction.

Masked language modeling (MLM)

For this task, given a piece of text like

The quick brown fox jumped over the lazy dog.

we randomly mask a small fraction of the tokens to obtain something like

The quick [MASK] fox jumped over the [MASK] dog.

After doing its best to take the surrounding context into account, the model makes a prediction about which tokens the [MASK] tokens obscure. To be a little bit more precise, a prediction here is not just a single token like monkey or brown; it is actually a probability distribution over all possible tokens. In other words, for the first [MASK] token, the model might output something like

Word	Probability
brown	0.07
monkey	0.002
excavator	0.05
…	…

(In reality, the vocabulary size over which the distribution is constructed is much larger than 3; it is often on the order of 10s or 100s of thousands.)

To quantify how well we’re doing at any point during training, we look at the distance between the output distribution (like the one in the table) to the “correct” distribution, where the token that was actually masked is assigned probability 1 and all other tokens are assigned probability 0.

As training progresses, with enough diverse data, the model will gradually produce spikier distributions, i.e., distributions where the probability of the right answer gets closer and closer to 1, and all other options tend to 0. Further along in training, the distribution I wrote just above might be something like:

Word	Probability
brown	0.7
monkey	0.0001
excavator	0.025
…	…

The BERT authors use this as one of a pair of tasks that help the model learn nontrivial statistical properties of the unlabeled training text. While masked language modeling helps the model grasp contextual relationships within a sentence, understanding the relationship between different sentences is equally crucial. This brings us to the next task used in BERT’s training: next sentence prediction.

Next sentence prediction

In addition to an ability to represent individual words well, the BERT authors also wanted the model’s representations to be suited to tasks that depend on an understanding of relationships between two pieces of text, which is not captured by the MLM task. To mix in an emphasis on this capability, the authors added another task called next sentence prediction (NSP), which works as follows:

Select a sentence A from the training corpus.
Select a sentence B from the training corpus. With a probability of 0.5, B is the sentence immediately following A, and with a probability of 0.5, B is some other, randomly selected sentence from the corpus.
If B comes after A, this (A, B) pair is labeled with IsNext. If B is random, the pair is labeled NotNext.

During training, the model tries to learn to predict the correct label for each pair.

(Note for the slightly more technically inclined reader: If you think about how we would actually do this binary classification, it’s not so straightforward. What we actually do is we set aside a special token, usually designated [CLS]. Each transformer layer mixes together the representations of the tokens output by the previous layer, so after the input has made its way through those layers, this [CLS] token has contextual information from the actual sentence tokens mixed into its final representation. The embedding of this [CLS] token is then fed to a binary classifier.)

During BERT’s training, the authors actually use both MLM and NSP and combine the error signals from each of them to update the model parameters. One interesting thing to note here is that these tasks don’t target any concrete feature of language, such as telling the difference between nouns and verbs, or correctly predicting correct verb tenses. Rather, we specify fuzzy concepts like being able to guess words from context and being able to tell when one sentence immediately follows another, and trust that the model will have to learn important language features in order to do those things well.

With both MLM and NSP, we’ve seen how self-supervision can uncover rich linguistic features. The versatility of self-supervision is also evident in the domain of computer vision, where contrastive learning techniques have proven highly effective. Let’s delve into one common contrastive learning training objective and its applications in generating powerful image embeddings.

Contrastive learning

Suppose that we want to train a model $M$ to produce generic image embeddings. One way we can do this is as follows. First, we produce a triplet of embeddings:

Select an image $x_1$ and compute its embedding $e_1 = M(x_1)$ .
Apply an augmentation to $x_1$ and compute the embedding of the augmented image $e_1' = M(x_1')$. (There are a wide variety of augmentations we can apply, including crops, rotations, or color inversions.)
Choose another random image $x_2$ from the dataset and compute its embedding $e_2 = M(x_2)$.

Contrastive objectives try to push embeddings of similar objects (e.g., an image and its crop) closer together, while pushing embeddings of unrelated objects (e.g., two random images) apart. We can set up an objective to do this using our three embeddings as follows:

$$ \mathcal L(e_1, e_1', e_2; M) = \max(\text{distance}(e_1, e_1') - \text{distance}(e_1, e_2) + \alpha, 0) $$

Let’s have a think about what’s going on here. The function $\mathcal L$ takes positive values when

$$ \text{distance}(e_1, e_1') - \text{distance}(e_1, e_2) > \alpha. $$

That is, when the distance between embeddings similar objects is larger than the distance between embeddings of unrelated objects by more than $\alpha$ (we choose $\alpha$), the model needs to be adjusted. We would take that positive loss and make updates to $M$’s parameters proportionally to how much each contributed to the loss.

With a large enough dataset of diverse images and enough training, this leads to image embeddings that can be applied to a range of downstream applications and tasks.

Conclusion

The shift from relying solely on traditional supervised learning to pre-training using self-supervision marks a significant evolution in machine learning, offering solutions to the challenges of scale imposed by the costs obtaining high-quality labeled data. By creatively designing training tasks that derive supervision from the data itself, we can harness the power of large, unlabeled datasets to pre-train models that can then be fine-tuned for specific applications.

From autoencoders and masked language modeling to next sentence prediction and contrastive learning, each method reveals both the ingenuity and surprising effectiveness of simple techniques in developing versatile and powerful models. This paradigm not only streamlines the development process but also paves the way for more adaptable and scalable AI systems.

Thanks for reading!

The connection between k-means and Gaussian mixtures

Thu, 04 Apr 2024 00:00:00 +0000

Introduction

In the next few posts I want to take you on the learning journey that happened when I tried to wrap my head around this wonderful paper by Brian Kulis and Michael I. Jordan (yes, Michael Jordan). The paper is about a variant of the k-means algorithm inspired by a Bayesian way of thinking about clustering. This helps solve a significant problem with k-means, namely that we don’t know the optimal value of $k$ in advance, and, furthermore, that there often is not even a way to make a reasonable guess of what a good value might be for a particular dataset.

In this post, we will go over the contents of section 2.1 of the paper, which is about a connection between k-means and a mixture of Gaussians model. In future posts, we will introduce a Bayesian approach to clustering, and see how that perspective is more than just philosophical; it can help us develop a new algorithm that is more flexible than k-means.

If you’re not familiar with k-means, I’d recommend reading up on it before proceeding. The Wikipedia page is a good place to start.

Background

We are going to develop the k-means algorithm in a somewhat non-standard way, but first we need to review a few preliminaries.

Mixtures of Gaussians

A mixture of Gaussians model can be succinctly expressed as follows:

$$ \begin{align*} p(x) = \sum_{i=1}^k \pi_i N(x | \mu_i, \Sigma_i). \end{align*} $$

This expression says that under our model, the probability assigned to sampling a point near $x$ is a mixture of the probabilities of sampling a point near $x$ according to each of $k$ different Gaussians. The mixing coefficient $\pi_i$ represents the amount of weight we place on the $i$th Gaussian (they are nonnegative and sum to 1), and $\mu_i$ and $\Sigma_i$ are the mean(vector)s and covariance(matrice)s, respectively.

One way to interpret this model that I find intuitively appealing is as an assumption that each point in your dataset is generated from one of $k$ distinct (Gaussian) generating processes. To sample a new point from this distribution, you can first sample a value $i$ using the $\pi_i$, and then subsequently sample a point from the corresponding Gaussian. (Note that this assumption may or may not be suitable for a particular dataset or application. As George Box has famously said: “All models are wrong, but some are useful.”)

EM algorithm

The EM algorithm is a way of carrying out maximum likelihood estimation in cases where the model is constructed with latent variables. Latent variables are often used to account for or represent characteristics of a population that cannot be directly observed. As an example, someone’s political orientation might not be directly observable, but if you ask questions about certain issues, you might be able to infer it.

The algorithm tries to find an optimal pair of quantities: the values of the latent variables and the parameter values. To do this, we alternate between two steps

Expectation step

The output of this step is a function that takes a potential parameter setting (i.e., a model) $\theta$, and outputs a score. (Higher scores are better.) The obvious question: How do we compute that score? Let’s say that $z$ is a possible latent variable setting. The score corresponding to $z$ is $\log p(x, z | \theta)$, i.e., the (log) likelihood of observing the data $x$ and $z$ assuming the model $\theta$. Since a “probable” setting of $z$ depends on the parameters we’ve estimated so far and the observed data $x$, the overall score for $\theta$ takes a weighted average of the scores across possible values of $z$, where the weights are the likelihood of observing $z$ given the latest parameters and the data $x$.

To summarize using notation we’ll refer to in the next section, the output of this step is a function we’ll call $Q$. To formalize the idea in the previous paragraph, we define $Q$ as follows:

$$ \begin{align*} Q(\theta; x, \theta^{(t)}) = E_{z \sim p(\cdot | x, \theta^{(t)})} [\log p(x, z | \theta)]. \end{align*} $$

Here, $\theta^{(t)}$ is our running parameter estimate; the $t$ is a subscript, not an exponent. At each step, we have access to a concrete value of $\theta^{(t)}$. As the algorithm progresses, we update $\theta^{(t)}$ to $\theta^{(t+1)}$ and so on. In contrast to $\theta^{(t)}$, $\theta$ is a placeholder for a parameter setting that we are constructing $Q$ to evaluate. In the same way that when we define $f(x)$, we don’t know anything about $x$ except that it’s an input to the $f$, similarly, when we work with $Q(\theta; x, \theta^{(t)})$, we don’t know anything about $\theta$ except that it’s a potential parameter setting we might want to test.

The semicolon in the definition of $Q$ is just a way of indicating that the function also depends on the data $x$ and the latest parameter setting $\theta^{(t)}$, but that we’re treating these as known (whereas $\theta$ is a variable).

Maximization step

Once we have the function $Q$, we simply maximize the function over $\theta$. The concrete update rule depends on the model we’re working with, but the general idea is to find the parameter setting that makes the data most likely. In general mathematical terms we are looking for

$$ \begin{align*} \theta^{(t+1)} = \arg\max_{\theta} Q(\theta; x, \theta^{(t)}). \end{align*} $$

We alternate between the expectation step (forming $Q$ using $\theta^{(t)}$) and maximization step (finding the parameter value that maximizes $Q$) until the algorithm converges.

K-means = EM + mixture of Gaussians + a limit

For each point $x_i$, we can imagine that there is a true, but latent, cluster assignment vector $z_i \in \text{one-hot}(k)$. Each entry $z_{ic}$ of $z_i$ is 1 if $x_i$ is a member of cluster $c$ and 0 otherwise. (Note that only one component of $z_i$ can be 1 for each $i$.) In our setup, since the cluster assignment is latent, $x_i$’s cluster membership is smeared across the different clusters probabilistically. We express this idea of a soft assignment by saying that there is a probability, $\gamma(z_{ic})$, that the point $x_i$ is in cluster $c$. (In contrast, k-means makes a hard assignment, where each point is assigned to exactly one cluster, in which case $\gamma$’s output would be binary. We’ll get to that in a moment.)

Let’s assume that the $x_i$ are drawn from a mixture of Gaussians with mixture coefficients $\pi_i$ for $i = 1,\dots, k$, but with the additional assumption that for each $i$, $\Sigma_i = \sigma I$. That is, assume that the covariance matrices for each Gaussian are all diagonal and have the same value $\sigma$ on their diagonals.

The entries of the $\gamma(z_i)$ (the vector of $\gamma(z_{ic})$ for $c = 1, \dots, k$) can be expressed as

$$ \begin{align*} \gamma(z_{ij}; \theta^{(t)}) &= \frac{\pi_j N(x_i | \mu_j, \sigma I)}{\sum_{l=1}^k \pi_l N(x_i | \mu_l, \sigma I)} \\ &= \frac{\pi_j \exp\left(-\frac{1}{2\sigma} \|x_i - \mu_j\|^2\right)}{\sum_{l=1}^k \pi_l \exp\left(-\frac{1}{2\sigma} \|x_i - \mu_l\|^2\right)}. \end{align*} $$

We write $\gamma(z_{ij}; \theta^{(t)})$ to emphasize that the $\gamma$’s are computed using the latest parameter estimates $\theta^{(t)}$.

The denominator here looks complicated, but it’s just there to make sure that the entries of $\gamma(z_i)$ sum to 1. The numerator is roughly the probability that $x_i$ was generated by the $j$th Gaussian. (The exact expression is a bit more complicated, but this is the intuition.)

The k-means algorithm can be seen as a limiting case of the EM algorithm applied to this mixture of Gaussians setup.

The E-step

The TL;DR is that the E-step of the EM algorithm computes the $\gamma(z_i)$ for each $x_i$ using the existing $\mu_j$s and $\pi_j$s. I want to be slightly more precise, though, about how this fits within the abstract framework of the EM algorithm that we outlined earlier.

The E-step computes $\gamma(z_i)$ as above. To map this onto the abstract definition of the E-step, we can first note that for a particular example $x_i$, we can compute the (log) likelihood of $x_i$ being generated by the $j$th Gaussian as

$$ \log p(x_i, z_{ij} = 1 | \theta) = \log \pi_j N(x_i | \mu_j, \sigma I). $$

Here, $\theta$ is the set of all parameters to evaluate, i.e., the $\pi_i$ and $\mu_i$. To compute the expected value of this quantity for $x_i$ over all possible latent variable settings, we can use the $\gamma(z_{ij})$ we computed above to weight the contribution of the log probability for each latent setting $z$:

$$ \begin{align*} Q(\theta; x_i, \theta^{(t)}) &= E_{z \sim p(\cdot | x_i, \theta^{(t)})} [\log p(x_i, z | \theta)] \\ &= \sum_{j=1}^k \gamma(z_{ij}) \log \pi_j N(x_i | \mu_j, \sigma I). \end{align*} $$

To extend this expectation to the entire dataset, we can simply sum over all the $x_i$:

$$ Q(\theta; x, \theta^{(t)}) = \sum_{i=1}^n \sum_{j=1}^k \gamma(z_{ij}; \theta^{(t)}) \log \pi_j N(x_i | \mu_j, \sigma I). $$

Note that in the equation above, the $\mu_j$ and $\pi_j$ are the parameters that we are evaluating with the function $Q$, whereas the latest means and mixture coefficients we’ve estimated so far (i.e., those represented by $\theta^{(t)}$) are used to compute the $\gamma(z_{ij})$.

The M-step

Since we’ve fixed the covariances, i.e. are not parameters that need to be found, the M-step updates the $\mu_j$ to be the weighted average of all of the $x_i$. Each point’s contribution to the computation of $\mu_j$ is weighted by the corresponding $\gamma(z_{ij})$, or how strongly it is attracted to $\mu_j$. This makes intuitive sense; the farther away a point is from a given mean, the (exponentially) less impact it has on that mean’s update.

Putting it all together

The final ingredient we need to make the rigorous connection to k-means is to note what happens when $\sigma$ gets smaller and smaller. As we assume that the $k$ Gaussians in our mixture become narrower and narrower, points that are not very close to the means have their likelihoods exponentially decay to 0. Since the sum in the denominator of the expression for $\gamma(z_{ij})$ becomes dominated by the term with the smallest distance to $\mu_j$, as $\sigma \to 0$, $\gamma(z_{ij})$ tends to 1 when $j$ minimizes $||x_i - \mu_j||^2$ and to 0 otherwise. This “one-hot"ing of the $\gamma(z_i)$ is the same as saying that letting $\sigma$ go to 0 turns our soft assignment into a hard assignment, which is exactly what k-means is designed to do.

To summarize: The k-means algorithm can be seen as the application of the EM algorithm to a mixture of Gaussians when we assume that the covariance matrices are fixed to $\sigma I$ and we let $\sigma$ go to 0.

Conclusion

That’s all for now. I’ve always found it quite satisfying to see how different, seemingly disparate theorems, algorithms, models, etc. can come together in unexpected ways.

In the next post, we’ll start to develop our Bayesian perspective to develop the new clustering algorithm I mentioned in the intro. Stay tuned!

RANSAC for robust data fitting

Fri, 22 Dec 2023 00:00:00 +0000

Introduction

In this post I want to introduce a non-standard way of fitting a mathematical model to data that I came across during a course I took this past semester in computer vision. While gradient descent rules the day (as it should!) the method we discuss here is actually pretty clever and its simplicity belies a significant benefit: robustness to outliers.

Linear models

What are they?

Much of machine learning and statistical modeling can be roughly characterized by the following sequence of steps:

Gather data about some phenomenon or process, often in the form of (input, output) pairs. In this setup, we hope that there is some meaningful relationship between the inputs and the outputs, and the outputs are the phenomenon we want to learn to predict.
Make some assumptions and try to find a useful mathematical model that explains the input/output relationship.
Use the learned mathematical model to make predictions on new examples that were not in the original training set of inputs and outputs.

(Mathematically, the model is a function $\hat f$ that maps inputs to outputs, hopefully without too much error. I call the function $\hat f$ here because one way of thinking about what we’re doing is that we’re trying to approximate some true function $f$ that relates the inputs to the outputs.)

There is a lot of nuance and subtlety to how we learn the model from data and how we verify that it’s working well on unseen examples, but that’s the basic idea.

One choice that we the modelers have to make before learning the model is the set of “shapes” it can take, which encodes an assumption about the underlying input/output relationship. The simplest possible model we can use is called a linear model, which assumes that the output changes linearly (usually with some small amount of random variation) as the input changes.

How do we learn them?

To make things more mathematically precise, let’s say our inputs $x_i$ are $d$-dimensional vectors of real numbers ($x_i \in \mathbf{R}^d$), and our corresponding outputs $y_i$ are real numbers ($y_i \in \mathbf{R}$). Using a linear model to model the relationship is equivalent to making the assumption that there is a vector of parameters $\theta \in \mathbf{R}^d$ (the slope) and an intercept $b \in \mathbf{R}$ such that

$$ \begin{align*} y_i = \theta^\top x_i + b + \varepsilon_i \end{align*} $$

where $\varepsilon_i$ is some randomness that our model doesn’t capture.

The mathematical characterization of the problem of finding the optimal parameters $\theta^\star$ and $b^\star$ is

$$ \begin{align} \theta^\star, b^\star := \underset{\theta, b}{\text{arg min}} ~\| \theta^\top X + b\mathbf{1} - y \|_2^2 \end{align} $$

Before continuing, let’s break down those symbols:

“$\theta^\star, b^\star :=$”: We typically use the $\star$ superscript to indicate the best or optimal choice. We use the $:=$ symbol to indicate that we are not writing out any mathematical equivalence, just a definition. We’re saying, “The optimal values of $\theta$ and $b$ are…”
“$\text{arg min}_{\theta, b}$”: If we had written something like $\min_x f(x)$, the “value” of the expression would be computed by plugging all possible $x$s into the function $f$ and returning the minimum value of $f(x)$. For example, if $f(x) = x^2$ and possible values of $x$ were $\{-1, 2, 0.1\}$, we would get $\min_x f(x) = f(0.1) = 0.01$. By Changing the $\min$ to $\text{arg min}$, we instead return the value of $x$ – rather than the value of $f(x)$ – that minimizes the value of $f(x)$. Thus, ``$\underset{\theta, b}{\text{arg min}}$’' means that we are looking for the values of $\theta$ and $b$ that minimize the rightmost term…
“$|| \theta^\top X + b\mathbf{1} - y ||_2^2$”: Without getting into detail here, this takes in our parameters $\theta$ and $b$, our data $X$, and our outputs $y$, and outputs a number that measures how well our parameters match inputs to outputs. (Lower is better.)

While we’re here, as is often the case with these types of formulations, we don’t get any information about how to write a computer program to actually obtain the parameters we seek. We have only addressed the problem setup. In fact, many such problem setups do not admit helpful algorithms even if they can be expressed simply.

Luckily for us, this particular problem can be solved very efficiently, and we most commonly use one of the following two methods:

Solve some equations using calculus and linear algebra (because in this case we can).
Use an algorithm called gradient descent (for cases when we can’t). This algorithm is the workhorse of almost all modern machine learning.

Neither of those is the topic of this post, but there are lots of nice articles out there if you are interested in learning more. For the rest of this post, I want to describe another less widely-known algorithm for finding those paramters.

RANSAC

One deficiency of the approaches mentioned above is that without certain auxiliary techniques, both of those methods are sensitive to outliers, as shown in the simple figure below (source):

Our intuition tells us that something is off here. The line in the image seems to miss that the true relationship, obscured by the outliers, is roughly along the diagonal from the bottom left to the top right of the plot.

One simple and interesting way of finding $\theta$ that is robust to outliers is called RANdom SAmpling Consensus (RANSAC). The way it works is as follows:

Randomly select a subset of $d$ input/output $(x, y)$ pairs. Stack the $x_i$s into a matrix $\tilde X$ ($x_i$ is the $i$th column) and the $y_i$s into a vector $\tilde y$.
Solve $\theta^\top \tilde X = \tilde y$ for $\theta$ (provided certain conditions are met, the equation has a unique solution).
Across the entire original dataset $X$, count inliers, or the number of $(x, y)$ pairs such that $\text{dist}(\theta^\top x, y)$ is small (the modeler chooses a threshold and an application-appropriate distance function $\text{dist}$).

In order to increase our chances of discovering a favorable parameter combination, we can repeat this process until the inlier count (perhaps as a fraction of the overall dataset size) is sufficiently high. The more times we are willing to repeat steps 1-3, the better a model we will find.

The beauty of this algorithm is that the paramter combination we end up with will have been computed from the most “normal” set of examples we encounter; in other words, the algorithm tends to avoid outliers that might adversely influence the model fit. To visually register how RANSAC is able to ignore outliers, the image below shows the much more reasonable line it would find with reasonable parameters (source):

While RANSAC is certainly elegant, there are downsides too. As we might expect, RANSAC tends to perform worse as the dataset becomes more and more polluted by outliers, though there are modifications to what we just described that increase the algorithm’s outlier tolerance. The primary disadvantage of RANSAC is that there is no guarantee that the algorithm will converge, which is a fancy way of saying that since our subset selections are random, we can’t be mathematically sure that we will eventually zero in on the optimal model parameters given our data.

Application to (old-school) computer vision

I came across RANSAC in a computer vision course at NYU taught by Robert Fergus. Near the end of the course, after reveling in the various and wondrous ways that neural networks have upended and redefined how computers process and, more recently, create, visual information, we had a final unit on techniques in computer vision that predated the deep learning revolution.

One problem from that unit that one might be interested in solving is to find some kind of correspondence between two images. For example, given the two images of the same scene on the left, we might want to produce a mosaic image like the one on the right (source):

To highlight where RANSAC comes into play, let’s suppose that we’ve waved our magic wand and (1) identified “key points” in both of the individual images, and (2) determined a set of correspondences between the key points in the top and bottom images.

RANSAC can help figure out the transformation that “happened” that caused the key points in one image to turn into their (hopefully correct) corresponding points in the other. It turns out that this transformation has parameters in and we can actually find the correspondence using RANSAC as follows:

Randomly select a subset of matching points from both images. (The number of matching points is determined by the number of parameters to estimate, in this case, 6.)
Find the parameters of a transformation $T$ that would turn points in one image into their matching points in the other.
For each matching pair of points, denoted $(k_1, k_2)$, count the number of inliers, i.e., pairs for with $T(k_1)$ isn’t far from $k_2$.

After finding the transformation that corresponds to the largest number of inliers, we can use this transformation to carry out the remaining steps required to combine the images.

Conclusion

In this post, we learned about RANSAC, an algorithm that finds the parameters of a model that can handle the presence of otherwise corruptive outliers. Algorithms that operate by trying a bunch of (literally) random options are generally the stuff of novices, so it’s cool when that very simplicity turns out to solve an important technical problem.

Faster language model inference

Thu, 06 Apr 2023 00:00:00 +0000

Introduction

Over the past few years, large language models (LLMs) – most recently ChatGPT – have received lots of (well-deserved) press. Though they have their shortcomings, they are able to compose shockingly cogent prose, and the quality appears to increase the bigger the models themselves become.

One aspect that I think is often overlooked by much of the public is the (computational, which implies financial) cost of actually turning the model’s crank to produce text. Soon after ChatGPT was released, it was estimated that at 1 million users it was costing OpenAI around 3 million dollars per month in cloud compute costs. With 100 million users (assuming nothing else has changed), this would cost 300 million dollars per month, or 3.6 billion dollars per year!

One of the themes we’ve observed over and over in technology over the past few decades is that when a disruptive, but costly, technological advance emerges, the cost of producing – or in this case, serving – that technology typically declines in response to increased motivation to profitably unleash the technology’s capabilities.

In this post, I want to describe a relatively simple and intuitive technique from a paper by DeepMind that might be a first step in the direction of bringing down the cost of operating the types of large language models that I believe will become ubiquitous in the years and decades to come.

(As a note, below I will use the word “expensive” a lot. This word refers to computational expense, but computational expense is directly relatable to financial expense, so you can think of it that way too if you’d like.)

Background

At a very high level, many large language models (such as GPT and friends) generally produce text using autoregressive sampling, which is a fancy term for sampling used previously generated (hence autoregressive) text to produce a probability distribution (hence sampling) over possible next words. To understand what this means, suppose your vocabulary has three words: “apple,” “banana,” and “carrot.” A distribution over these three words in essence assigns a probability to obtaining each word (the probabilities have to add up to 1) if you were to sample randomly from them. (There are infinitely many possible distributions you could choose. Usually, assumptions about distributions have to be reasonable, not correct.)

The distribution over your vocabulary words is determined by the model at each step. In some sense, the model produces a distribution that makes sense given the text you’ve already generated (and/or a prompt that you wrote). If you had already generated the partial sentence “I will wear a raincoat because it is going to,” a good model would produce a distribution over your vocabulary that indicates a very high probability on a word like “rain,” and a very low probability on a word like “spinach” (which contextually doesn’t fit).

The takeaway from this can be summarized as follows:

These models are very large, so each step (word) is expensive to compute.
Because the samples have to be produced sequentially, they require many LLM steps.

Put simply, many steps x high cost = very high cost!

The New Idea

DeepMind attempts to tackle (2) from the previous section by reducing the number of inference steps required of the very large (and very expensive) model while maintaining the high quality of the tokens; it sounds like free lunch! How do they do this?

The basic idea is that given some number of previous words, we:

Use a smaller, less expensive model to generate a candidate sequence of a certain length.
Use the big model to score the words generated by the small model. (The scores here can be thought of as measures of approval of the draft tokens by the big model.)
Use the word scores to decide how much of the sequence to use.
Rinse and repeat until the sequence is of the desired length.

If you stop reading here, you’ve learned the important idea. Lately, I’ve been trying to keep posts very high level, but in this case, the fact that the importance-to-complexity ratio of this idea is very high, I’m going to break pattern and go into some more technical detail in the sections below.

Speculative Sampling

We will now discuss the algorithm in more detail. The below steps are carried out until the sequence is of the desired length.

Step 1: Generate and Score a Draft

Draft generation must be sequential, but the scoring can happen in parallel. (This is one speed-up source.) In this context, scoring a token means computing the probability of that token occurring given the current sequence and the already generated draft tokens. This is where the algorithm’s speed-up comes from. The draft is generated using a smaller, less expensive model (the draft model), and the scoring – which requires appealing to the large model – can happen in parallel.

Step 2: Deciding How Much of the Sequence to Accept

Next, the algorithm requires figuring out how much of the sequence to accept. This takes the form of accepting each successive token produced by the draft model with some probability (that depends on the prior tokens that have already been accepted). Once we decide not to accept a token, we sample a token from some distribution (we will think through a good one to use below) and start fresh with a new draft. In my opinion, choosing the right probability and the right alternative distribution is where the algorithm’s cleverness is really on display. To be specific about what we aim to disambiguate in this section, there a two questions we seek to answer:

What probability $r$ should we use to decide whether or not to accept the next token?
What alternative distribution should we use if the $i$th token of the draft is not accepted?

Before addressing these questions, we should state this algorithm’s overall objective a little bit more precisely: We want the speculatively sampled sequence to come from the same distribution as the sequence we would get if we autoregressively sampled a sequence from the large expensive model.

Another item to clarify is that we can, in some sense, discuss models and distributions interchangeably. Here, a model is a way of taking a sequence and producing probabilities that each token in the vocabulary is the next one in the sequence. We can thus refer to and talk about models like we refer to distributions. To this end, let $q(x \mid x_1, x_2, \dots, x_i)$ be the expensive model, and $p(x \mid x_1, x_2, \dots, x_i)$ be the draft model. As in the paper, we will also refer to the $t$th draft token as $\tilde x_t$.

I will first state the answers to questions (1) and (2), and then I will show that they work. The probability $r$ that we use is given by the expression

$$ \begin{align*} r = \min\biggr(1, \frac{q(\tilde x_t \mid x_1, \dots, x_{n+t-1})}{p(\tilde x_t \mid x_1, \dots, x_{n+t-1})} \biggr) \end{align*} $$

and the rejection distribution is

$$ \begin{align} (q(x \mid x_1, \dots, x_{n+t-1}) - p(x \mid x_1, \dots, x_{n+t-1}))_+ = \frac{\max(0, q(x \mid x_1, \dots, x_{n+t-1}) - p(x \mid x_1, \dots, x_{n+t-1}))} {\sum_{x'} \max(0, q(x' \mid x_1, \dots, x_{n+t-1}) - p(x' \mid x_1, \dots, x_{n+t-1}))} \end{align} $$

These probabilities and distributions depend on the combined initial sequence and already accepted draft tokens. Also, note that if $q(x \mid x_1, \dots, x_{n+t-1}) > p(x \mid x_1, \dots, x_{n+t-1})$, we automatically accept the $t$th draft token. Intuitively, this checks out, because if the draft model produces a token which, given the prior sequence tokens, is more likely to have been produced by the large model than the draft model, of course we should use it!

If instead we have $q(x \mid x_1, \dots, x_{n+t-1}) \leq p(x \mid x_1, \dots, x_{n+t-1})$, then we accept $\tilde x_t$ with a probability that is larger when $q$’s score is close to $p$’s. If $p$ gives $\tilde x_t$ a score of 0.36 and $q$ gives it a score of 0.12, for example, then we will accept $\tilde x_t$ with probability 0.12 / 0.36 = 1/3. Alternatively, if $p$ gives a score of 0.36 and $q$ gives a score of 0.0001, we will likely not accept the token because the target and draft models really disagree about whether $\tilde x_t$ would make a good next token.

(Recall from earlier that we computed the $q$ scores for the draft tokens in parallel during Step 1.)

If we accept $\tilde x_t$, then we take another step and consider whether or not to accept $\tilde x_{t+1}$. If, on the other hand, we reject $\tilde x_t$, we sample from the complicated looking distribution we spelled out in equation (1). What we want is a distribution that re-weights the possible tokens to sample from the large expensive model in a sensible way. In our case, $q - p$ (in the numerator) will produce negative “probabilities” for tokens where $q < p$, and positive values where $q > p$. This kind of makes sense, because if $\tilde x_t$ is rejected, we want to favor sampling tokens to which $q$ assigns higher scores than $p$ does. We have two problems, though:

Probabilities have to be nonnegative
Probabilities have to sum to 1

But these are actually no problem at all! To solve (1), we modify $q-p$ to $\max(0, q-p)$, and solve (2), we use the standard normalization trick: dividing by the sum! (For example, if I wanted to make the list [1, 2, 3, 4, 5] into a probability distribution, I would divide each element by the sum to obtain [1/15, 2/15, 3/15, 4/15, 5/15].) Once we make both of those modifications, we obtain equation (1), which we shorten to the expression on the left side of the equals sign.

So far, we have resorted to intuition to motivate our choices, but it turns out that the two choices fit together like elegant mathematical puzzle pieces. How this happens goes back to our objective, which is to come up with a sample that looks as though it was obtained strictly using the target (expensive model). Let’s see if we can show that our strategy helps accomplish this.

Proving that we recover the target distribution

If we have two discrete distributions $a$ (target) and $b$ (draft) and a draft sample $x' \sim b$ ($\sim$ means “sampled from”), let $X$ be a random variable representing the sample produced by speculative sampling. If $X$ ends up taking on a specific value $x$, there are two possible ways it could have happened:

We accepted $x'$, in which case $x' = x$.
We rejected $x'$, in which case $x \sim (a - b)_+$.

Outcome 1

The probability of outcome (1) is the probability that the draft sample $x'$ is accepted given that it takes the particular value $x$. We have to multiply this by the probability that the draft distribution assigns to the event that $x'$ takes on that value, so we have

$$ \begin{align*} P(\text{option 1}) = P(x'~\text{accepted} \mid x' = x)P(x' = x) \end{align*} $$

The probability of sampling the value of $x$ from the distribution $b$ is simply $b(x)$. The probability of accepting it is $\min(1, a(x) / b(x))$ (from the algorithm specification). We thus have

$$ \begin{align*} P(\text{option 1}) = b(x) \min(1, a(x)/b(x)) = \min(b(x), a(x)). \end{align*} $$

Outcome 2

On the other hand, if $x'$ is rejected, then the probability that $X$ takes the value $x$ is the probability of sampling $x$ from $(a(x) - b(x))_+$. By our definition of that distribution, we would have

$$ \begin{align*} P(X = x \mid x'~\text{rejected}) = \frac{\max(0, a(x) - b(x))}{\sum_{\hat x} \max(0, a(\hat x) - b(\hat x))}. \end{align*} $$

We need to weight this outcome by the probability that $x'$ is rejected, which is given by

$$ \begin{align*} P(x'~\text{rejected}) &= 1 - P(x'~\text{accepted}) \\ &= 1 - \sum_{\hat x} P(X = \hat x, x'~\text{accepted}) \\ &= 1 - \sum_{\hat x} \min(a(\hat x), b(\hat x)) \\ &= \sum_{\hat x} a(\hat x) - \min(a(\hat x), b(\hat x)) \\ &= \sum_{\hat x} \max(0, a(\hat x) - b(\hat x)). \end{align*} $$

In the above sequence of equalities, the second uses the fact that to get the $p(x)$, we can sum up $p(x, y)$ over all possible values of $y$. The third equality reuses our computation from Outcome 1. The fourth uses the fact that the 1 outside the summation can be broken into probabilities $a(\hat x)$ for all possible values of $\hat x$, since probabilities must sum to 1. Finally, the last equality follows when you flip $-\min(a(\hat x), b(\hat x))$ to $\max(-a(\hat x), -b(\hat x))$ and then add the $a(\hat x)$ to both of the arguments to $\max$.

Now that we’ve worked all of the details out, does the last expression look familiar? It is the denominator of $P(X = x \mid x'~\text{rejected})$! Multiplying our two probabilities together, we have

$$ \begin{align*} P(\text{option 2}) &= P(X = x \mid x'~\text{rejected}) P(x'~\text{rejected}) \\ &= \frac{\max(0, a(x) - b(x))}{\sum_{\hat x} \max(0, a(\hat x) - b(\hat x))} \sum_{\hat x} \max(0, a(\hat x) - b(\hat x)) \\ &= \max(0, a(x) - b(x)) \end{align*} $$

Putting Them Together

Now that we’ve computed probabilities for both options, we note that the two possibilities are mutually exclusive and exhaustive ways that $X$ can take the value $x$. Thus, the probability $P(X = x)$ is given by

$$ \begin{align*} P(X = x) &= P(\text{option 1}) + P(\text{option 2}) \\ &= \min(b(x), a(x)) + \max(0, a(x) - b(x)). \end{align*} $$

Now, if $a(x) > b(x)$, then the first term is $b(x)$ and the second term is $a(x) - b(x)$. Adding these together, we get $a(x)$. If $a(x) \leq b(x)$, then the first term is $a(x)$ and the second term is 0, so again the sum is $a(x)$. Thus, speculative sampling recovers the target distribution $a(x)$. In other words, the rejection sampling technique we’ve devised produces sequences of tokens that are theoretically indistinguishable from the very expensive target model!

Conclusion

While this technique is just one step towards making LLMs more efficient, it highlights the potential return on further innovation in the space of faster LLM inference. As the number of LLM applications continues to explode, we can expect even more creative solutions to emerge, hopefully making these powerful tools more accessible and affordable for everyone.

If not speculative sampling itself, methods in the same spirit will become more necessary and important as we continue to push the boundaries of size and scale in generative models. I thought this technique was worth illuminating because of its simple, yet powerful, theoretically grounded choices. In many deep learning applications, systems often seem like quasi-magical feats of engineering whose designers don’t even always know why they work as well as they do. In reading DeepMind’s speculative sampling paper, I found the technique’s simplicity and mathematical rigor refreshing.

How does OpenAI's DALL-E work?

Tue, 03 Jan 2023 00:00:00 +0000

Introduction

Before I say anything else, create an account and take DALL-E 2 out for a spin. It is an example of generative AI, which, as a field, has seen important, exciting, and viral breakthroughs during 2022. Here are some examples from the DALL E paper:

Examples of captions and generated images from the DALL-E paper.

To understand what generative AI is, let’s say you have images that have each been labeled as either an image that contains a cat, or an image that does not contain a cat. A (probabilistic) discriminative model is one that tries to extract some kind of meaningful information from images (such as the presence of certain types of edges or textures) and discriminate between cat and non-cat images. Most classification models implemented across industry are of this type. A generative model, on the other hand learns how to generate new (image, cat/no cat label) pairs. Ian Goodfellow, a pioneer of generative modeling, very succinctly explains why generative modeling is usually much harder to do well:

Can you look at a painting and recognize it as being the Mona Lisa? You probably can. That’s discriminative modeling. Can you paint the Mona Lisa yourself? You probably can’t. That’s generative modeling.

The most popular examples of generative AI that have emerged over the past year have been in the fields of computer vision (Stable Diffusion, DALL-E) and natural language processing (Chat GPT), and in this post, I want to try to provide a very high-level overview of how DALL-E learns to generate images from text prompts.

What we will not discuss (but should at least mention)

The “machine learning” that powers many of these large, impressive models is very often not the hardest part of developing them. There are computational considerations and optimizations that are critically important to making such models actually work, but I don’t think describing them here will add a lot of value for my readers, so we won’t be going into those details. (In fact, in the DALL-E paper, they explicitly acknowledge that “[g]etting the model to train in 16-bit precision past one billion paremeters, without diverging was the most challenging part of this project.”)

Another important thing to remember is that these amazing capabilities of these models emerge from mountains of high quality data and computing power. Even if the model code is made available to the public, training and using the models without sufficient ability to handle huge quantities of data and throw massive amounts of compute at the problem will result in useless models at best. There are some efforts to create smaller versions of these models that can be run and used by individuals, but so far they don’t seem as capable, in general, as their huge progenitors.

Disclaimer

This post is my high-level overview/summary of part of the DALL-E paper. It is possible that I misunderstood something or explained it incorrectly. If you come across any such mistakes, let me know so that I can learn from and correct them.

With that out of the way, let’s have a look under the hood.

High level overview

DALL-E learns using about 250 million (image, text) pairs. The learning process is roughly:

Learn how to come up with useful (and compressed) representations of images
Turn each prompt into a sequence of tokens
Combine the representations of the images and the corresponding prompts
Train a neural network to predict the next token of the combined representation given some of the previous tokens.

We will discuss each of these steps in turn.

Learning image representations

One of the most important, foundational ideas in deep learning is: neural networks like numbers (not images or text). In order to bring the full power of neural networks to bear on text and vision problems, we usually use approaches that are variations on this theme:

Convert the image (or text) into a numerical representation such that the meaning/content of the image (or text) is preserved. (For example, in representation space, collections of numbers corresponding to images of cats should be closer to one another than they are to collections of numbers corresponding to images of galaxies.) We will refer to these as representations or embeddings.
Use the representations for some task that we care about (e.g. discriminate between cat and non-cat images).

There are many approaches that accomplish step 1. In the case of DALL-E, OpenAI used an autoencoder. This architecture consists of two parts:

Encoder: takes an image and produces a representation
Decoder: takes the representation and tries to reproduce the image

In order to learn, the encoder and decoder use penalties incurred for differences between the original image and the decoder’s reconstruction to update their internal states. When training is complete, we can use the encoder to produce useful image representations for whatever other tasks we intend to carry out with the images as input. Intuitively, we can think about an autoencoder as a kind of compression algorithm. We take an image, compress it into a representation that is (1) smaller than the original image, but (2) such that we can (mostly) reconstruct the original image. If we can do this well, then we’ve come up with a representation that seems to carry much of the important information from the original image, which is what we wanted.

(Really, it uses the generative and more involved cousin of autoencoders called a variational encoder, but the high level idea of learning a compressed representation is the same.)

The autoencoder that DALL-E uses compresses 256x256 (dimensions in pixels) images into 32x32 representations. Each 256x256 image has 3 numbers associated with each pixel (red, green, and blue concentrations). Thus, each original image requires 256 * 256 * 3 = 196,608 numbers to represent it, whereas the representation only requires 32 * 32 = 1024 numbers. This is a compression factor of 192!

For a visual idea of how good these encodings are once the system is trained, here are some examples of original and reconstructed images:

Original (top) and reconstructed (bottom) images produced by the image representation learning system.

(Note: Representations of this kind are often continuous, meaning the numbers in each representation slot can be any real number. In this case, the encoding is discrete, which just means that the numbers in each representation slot are actually whole numbers instead.)

Encoding prompts

To encode the prompts, DALL-E uses byte-pair encoding, which can be more broadly be categorized as a tokenization method. Tokenization is a way of breaking down unstructured natural language into a finite set of meaningful atomic units. It can be carried out at the level of sentences, words, or even parts of words (e.g., “eating” might be broken into “eat” and “ing”). Each token in a limited vocabulary (e.g., 10k frequent words or subwords) is usually assigned an identifier – in this case, a number (there are usually also numbers reserved for unknown tokens, beginnings and ends of sentences, etc.). Once we’ve decided how we’re going to tokenize and have a vocabulary of identifiers, we can then process our input text and turn it into its corresponding tokenized representation.

(We can manually decide the level at which to tokenize text (e.g., words, sentences) or we can use machine learning to learn a good tokenization, but we aren’t going to go into detail about those approaches here.)

Jointly modeling text and image tokens

Once we have computed our image representations and tokenized our text, we glue the representations together into what is essentially a composite representation for an (text, image) pair. We then train a model called a transformer whose inputs are the entire stream of concatenated text and image tokens. The model learns to predict the next token in the sequence using earlier tokens.

Transformers are based on a mechanism called (self-)attention. In our case, this means that the model learns to give different weight to different previously generated tokens as it attempts to predict the next one. For example, if we were trying to predict the next word (“cold”) in the sentence “I need a jacket because it is ___”, the model would learn that it should give the word “jacket” more weight than the word “I.”

(DALL-E actually uses multi-head attention, which means it actually learns many weighting schemes at the same time for a given set of previously generated tokens and then makes a prediction using a combination of the outputs of all of those schemes.)

How to generate images

Once we’ve trained the transformer from the previous section, given a new prompt that we haven’t seen, we can:

Encode it using the encoder we trained in the first step
Generate (hopefully reasonable) image tokens one-by-one using our transformer. The first image token would be generated using all of the text tokens, the second image token would use all text tokens and the first image token, and so on. (Note: This is not exactly how it works, but it’s close enough for our purposes.)
Use the decoder that we trained in the first step to translate those image tokens back into an image

Conclusion

I glossed over many of the mathematical and computational details of how this works (I don’t even have my head around all of them!), but the goal was to demystify one approach used to build a(n awesome) generative AI system that has taken the internet by storm. Hopefully you enjoyed!

Probabilistic interpretation of regularization

Sun, 09 May 2021 00:00:00 +0000

Introduction

If you’ve read enough of my posts over the years, you know that some of my favorite topics to write about are those that can be thought about or studied from different perspectives. In this post, I want to write about regularization, a technique used in machine learning to mitigate a common problem called overfitting – a problem that crops up when algorithms fit their understanding of the world so tightly to a particular dataset, that it isn’t able to make predictions about data that it hasn’t seen. Regularization can be thought about as a term to add to the optimization objective that directly discourages overfitting, or it can be thought of in an interesting statistical way.

A helpful example: simple linear regression

Let’s assume that we’re building a linear regression model. That is, assuming that $b$ represents the number of bedrooms in the house, we assume (more on this in a second) that the relationship between the number of bedrooms and the price of the house is linear:

$$ p(b) = \theta b + \epsilon. $$

Here, $p$ is price, $b$ is the number of bedrooms in the house, and $\epsilon$ is a random number that represents the error in our model, or the part of our model that the data we are using do not explain. What the formula above says is that we believe that we can model the relationship between #bedrooms and price with a linear model. This does not say that we believe that the actual relationship is linear. This is a very important distinction. We believe that the linear model might be useful, not necessarily that it is correct or true.

(Another way of thinking about our equation, or model, is that it says once we know the parameter $\theta$ and the particular number of bedrooms $b_0$, the randomness has been confined to the variation of prices around a known mean: $\theta b_0$.)

Fitting the parameters

One natural way to find the best parameter $\theta$ for a set of data is to find the the value of $\theta$ that literally best fits the input data! To better understand what this means, let’s suppose that we have (#bedroom, price) pairs $(b_i, y_i)$ for $i=1,\dots,100$, and a current guess at a parameter $\theta$. To evaluate $\theta$, a natural measure of how well $\theta$ fits is to find the average (square) error, where the error for each example can be expressed as the difference between $\theta b_i$ (our prediction) and $y_i$ (the actual price). Mathematically, we can write this measure down as

$$ J(\theta) = \frac{1}{100} \sum_{i=1}^{100} (\theta b_i - y_i)^2. $$

Now that we’ve decided what constitutes a good choice of parameter, we can employ tools provided by calculus to actually calculate what value of $\theta$ is best by solving the optimization problem (replacing 100 with the more general $m$) given by

$$ \text{argmin}_\theta \sum_{i=1}^m (\theta b_i - y_i)^2. $$

(This is the value of $\theta$ that minimizes $J$. Without going into detail, in this case, it turns out that the best value is $\theta = \frac{\mathbf y^T \mathbf b}{\mathbf b^T \mathbf b}$, where $\mathbf b = (b_1,\dots, b_m)$ and $\mathbf y = (y_1,\dots,y_m)$.)

Regularization

One important concern when fitting a machine learning model is whether or not your model is too tightly fit to the data that you have. Because models are fit using a finite sample of data, it is possible, even likely, that your data is not representative of what can occur “in the wild.” As such, the model you’ve built may be terrific on the data it used to train, but does not actually generalize to situations it hasn’t encountered. There are various techniques for combating this problem, but the one we will discuss here is one called regularization.

The intuitive motivation

In models with more than one feature, overfitting tends to occur because certain of the features have parameters that are too large, i.e., that their impact is overstated in the model. As such, rather than just finding the parameters that minimize the least squares objective, we want to find small parameters that minimize the objective. For the simple regression case, we would add a term to the objective:

$$ \text{argmin}_\theta ~~ \sum_{y_i, b_i}(\theta b_i - y_i)^2 + \frac{\lambda}{2} \theta^2 $$

Intuitively, if $\theta$ is large, the objective value that we are trying to minimize will also be large, so the optimizer will not be encouraged to pick that value of $\theta$, even it fits the data pretty well. Adding this term causes the optimizer to trade off goodness of fit and simplicity (in the sense of parameters that aren’t too large). The constant $\lambda$ controls our preferences with respect to that tradeoff: larger values of $\lambda$ will encourage smaller value of $\theta$.

Statistical interpretation

While the intuitive motivation is usually enough, there is a cool statistical interpretation of what is going on here that I think is worth pointing out. If we instead think of finding $\theta$ by carrying out maximum (log) likelihood estimation (MLE), then regularization naturally arises when we add the additional assumption to our model that $\theta$ comes from a normal distribution centered around zero with variance $1/\lambda$ (we can tune $\lambda$ to change the width of the bell curve). Making this assumption essentially pins down a probability density function for the parameter $\theta$: $P(\theta) = \frac{\sqrt{\lambda}}{\sqrt{2\pi}}\exp(-\lambda(\theta - 0)^2/2)$. Taking $\log$s (this doesn’t affect the optimization problem we need to solve), we have

$$ \log P(\theta) = \log\biggr( \frac{\sqrt{\lambda}}{\sqrt{2\pi}} \biggr) - \frac{\lambda}{2} \theta^2. $$

Adding this assumption about the prior distribution over $\theta$ and ignoring constants (with respect to $\theta$), we would need to solve the modified MLE problem:

$$ \begin{align} \text{argmax}_\theta~ \log(P(\mathbf y ~|~ \mathbf b, \theta)P(\theta)) &= \text{argmax}_\theta ~ -\sum_{y_i, b_i}(\theta b_i - y_i)^2 - \frac{\lambda}{2} \theta^2\\ &= \text{argmin}_\theta ~ \sum_{y_i, b_i}(\theta b_i - y_i)^2 + \frac{\lambda}{2} \theta^2 \end{align} $$

which is exactly what we had intuitively motivated in the previous section!

We’ve just uncovered the statistical interpretation of regularization!* Using (this flavor of) regularization is actually imposing a Gaussian prior onto $\theta$. As we force the width of $\theta$’s bell curve to become smaller by increasing $\lambda$, we are encoding the fact that larger values of $\theta$ are less likely and should therefore be penalized more heavily during the optimization process.

Conclusion

In this post, we encountered a cool technique that underlies many statistical models called maximum likelihood estimation (MLE), and showed that a common technique used to combat overfitting actually has a nice statistical interpretation, too!

Happy Mother’s Day to all!

*The regularization we discuss here is called L2 regularization. Regularization comes in other forms too. The most popular other choice is called L1 regularization, and it can actually be interpreted as imposing a Laplacian (rather than a Gaussian) prior.

MSE = Bias² + Variance

Mon, 23 Sep 2019 00:00:00 +0000

Introduction

In statistics, the overarching goal – in some sense – is to figure out how to take a limited amount of data and reliably make inferences about the broader population we don’t have data about. To do this, we study, develop, and use mathematical objects called statistics or estimators. Formally, these are functions of other objects called random variables, but, for the moment, it suffices to think of them as ways of using limited amounts of data furnished by a part to learn about the whole.

As an example, let’s say that you wanted to find out the average height of all humans on earth, and let’s further suppose that the actual average human height is 3.5 feet. You might take a random sample of 1000 people, add up all heights and divide by 1000. The height that you get from that procedure, the mean height of your sample, is an estimator of the actual population height; let’s call it $A$. Alternatively, let’s say you again took that same sample of 1000 people, disregarded it and decided that your estimator of the mean population height was going to be zero feet, zero inches; call this (rather silly) estimator $B$.

Before developing any formal measure of the quality of an estimator, think about the above two estimators. Does one seem “more reasonable” than the other? For any parameter you want to estimate using some data you’ve collected, there are infinitely many estimators you could come up with; one natural concern we might seek to address is how to mathematically distinguish quality estimators from useless estimators.

Mean squared error = bias + variance

Mean squared error (henceforth MSE) is an attempt to formally capture the difference in the quality of different estimators. It is defined as the expected value of the square of the distance between the estimator’s value and the true value of the parameter you are trying to estimate with it. In the example above:

The true value of the parameter (average human height) is 3.5 feet. The estimator is the sample average of 1000 people that you calculated. Because the estimator is a function of random variables (heights of people you sampled), it, too, is a random variable, say $X$, with some distribution. We can therefore think about the expected value of some function of $X$ - in our case, the function is $f(X) = (X - 3.5)^2$. To compute the expected value you take a weighted average of all possible values $X$ can take and weight each one by the probability of seeing that outcome. (Don’t worry too much about why we’re squaring; it makes calculus easier.)

Mathematically, we write

$$ \text{MSE}(\hat y) = E_y((\hat y - y)^2) $$

where $\hat y$ is the estimator and $y$ is the true parameter value.

Intuitively, if we expect the estimator to, on average, stray far from the true value of the parameter, the estimator is probably not very good. Alternatively, if that expected value is close to 0, it means that the estimator deviates very little from the actual parameter, i.e. it’s a great estimator! With some tedious algebra, we can actually show that

$$ \begin{align} \text{MSE}(\hat y) &= E_y((\hat y - y)^2) \\ &= \dots \\ &= E((\hat y - E(\hat y))^2) + (E(\hat y) - \hat y)^2 \\ &= \text{Var}_y(\hat y) + (\text{Bias}_y(\hat y))^2 \end{align} $$

Bias and variance are two very important qualities of estimators that help us understand how they relate to the real value you’re attempting to estimate:

Bias

Bias tells you the difference between the expected value of your estimator and that actual value of the parameter. To intuitively grasp this, imagine throwing several darts at a board, all of which strike the board close together, but off of the true center. You’re throws had high bias; in some sense, your “average” throw’s position would have been consistent, but generally off the mark.

Variance

Variance tells you how much the estimator tends to move around it’s expected value. If the estimator (just a function) takes values spread widely around it’s mean, variance will be high. Suppose you know two basketball players, both of whom average 15 points per game. One of them scores 30 one game and 0 the next, and the other scores around 15 consistently. While both players have the same scoring average, one of their scoring patterns is high variance – i.e. deviates pretty far from the mean – and the other is low variance.

With this decomposition in mind, we can see that if we choose to use MSE as our metric of estimator quality, it actually decomposes very nicely into two intuitively appealing sources of error. Therefore, to lower MSE, we need to either reduce bias or reduce variance (or reduce both!). It isn’t always so simple, though, as it is possible that by reducing one, you might have to raise the other, hence the name, the bias-variance tradeoff. There are a bunch of techniques that are actually quite useful in practice which leverage the decomposition of MSE into bias and variance.

In algorithms

For example, random forests are estimators that average the opinions of a bunch of high variance (aka not very general) decision trees to, in aggregate, function as a lower variance estimator. (This technique, a special case of a more general class of bagging algorithms, uses the fact that the variance of an average of $n$ independent estimators decreases as $n$ gets bigger.)

Attacking the MSE problem from the bias side, we have boosting algorithms, wherein the decision trees (or any other base estimator) are trained in sequence. Each tree in the sequence trains itself more tightly to the examples that the previous tree predicted incorrectly. In this sense, you are starting with a silly, low variance, estimator and gradually fitting it more tightly to the data, reducing bias, so that the sequence, all together, becomes more useful.

Conclusion

While I’ve admittedly left out some of the technical details of bagging and boosting, the point is to illustrate that something abstract-seeming like MSE can be understood in a very concrete way, and that the decomposition we discussed can actually lead to practical problem-solving approaches that are actually quite useful.