Introduction
In this post I want to introduce a non-standard way of fitting a mathematical model to data that I came across during a course I took this past semester in computer vision. While gradient descent rules the day (as it should!) the method we discuss here is actually pretty clever and its simplicity belies a significant benefit: robustness to outliers.
Linear models
What are they?
Much of machine learning and statistical modeling can be roughly characterized by the following sequence of steps:
- Gather data about some phenomenon or process, often in the form of (input, output) pairs. In this setup, we hope that there is some meaningful relationship between the inputs and the outputs, and the outputs are the phenomenon we want to learn to predict.
- Make some assumptions and try to find a useful mathematical model that explains the input/output relationship.
- Use the learned mathematical model to make predictions on new examples that were not in the original training set of inputs and outputs.
(Mathematically, the model is a function $\hat f$ that maps inputs to outputs, hopefully without too much error. I call the function $\hat f$ here because one way of thinking about what we’re doing is that we’re trying to approximate some true function $f$ that relates the inputs to the outputs.)
There is a lot of nuance and subtlety to how we learn the model from data and how we verify that it’s working well on unseen examples, but that’s the basic idea.
One choice that we the modelers have to make before learning the model is the set of “shapes” it can take, which encodes an assumption about the underlying input/output relationship. The simplest possible model we can use is called a linear model, which assumes that the output changes linearly (usually with some small amount of random variation) as the input changes.
How do we learn them?
To make things more mathematically precise, let’s say our inputs $x_i$ are $d$-dimensional vectors of real numbers ($x_i \in \mathbf{R}^d$), and our corresponding outputs $y_i$ are real numbers ($y_i \in \mathbf{R}$). Using a linear model to model the relationship is equivalent to making the assumption that there is a vector of parameters $\theta \in \mathbf{R}^d$ (the slope) and an intercept $b \in \mathbf{R}$ such that
where $\varepsilon_i$ is some randomness that our model doesn’t capture.
The mathematical characterization of the problem of finding the optimal parameters $\theta^\star$ and $b^\star$ is
Before continuing, let’s break down those symbols:
- ”$\theta^\star, b^\star :=$”: We typically use the $\star$ superscript to indicate the best or optimal choice. We use the $:=$ symbol to indicate that we are not writing out any mathematical equivalence, just a definition. We’re saying, “The optimal values of $\theta$ and $b$ are…”
- ”$\text{arg min}_{\theta, b}$”: If we had written something like $\min_x f(x)$, the “value” of the expression would be computed by plugging all possible $x$s into the function $f$ and returning the minimum value of $f(x)$. For example, if $f(x) = x^2$ and possible values of $x$ were $\{-1, 2, 0.1\}$, we would get $\min_x f(x) = f(0.1) = 0.01$. By Changing the $\min$ to $\text{arg min}$, we instead return the value of $x$ – rather than the value of $f(x)$ – that minimizes the value of $f(x)$. Thus, ``$\underset{\theta, b}{\text{arg min}}$’’ means that we are looking for the values of $\theta$ and $b$ that minimize the rightmost term…
- ”$|| \theta^\top X + b\mathbf{1} - y ||_2^2$”: Without getting into detail here, this takes in our parameters $\theta$ and $b$, our data $X$, and our outputs $y$, and outputs a number that measures how well our parameters match inputs to outputs. (Lower is better.)
While we’re here, as is often the case with these types of formulations, we don’t get any information about how to write a computer program to actually obtain the parameters we seek. We have only addressed the problem setup. In fact, many such problem setups do not admit helpful algorithms even if they can be expressed simply.
Luckily for us, this particular problem can be solved very efficiently, and we most commonly use one of the following two methods:
- Solve some equations using calculus and linear algebra (because in this case we can).
- Use an algorithm called gradient descent (for cases when we can’t). This algorithm is the workhorse of almost all modern machine learning.
Neither of those is the topic of this post, but there are lots of nice articles out there if you are interested in learning more. For the rest of this post, I want to describe another less widely-known algorithm for finding those paramters.
RANSAC
One deficiency of the approaches mentioned above is that without certain auxiliary techniques, both of those methods are sensitive to outliers, as shown in the simple figure below (source):
Our intuition tells us that something is off here. The line in the image seems to miss that the true relationship, obscured by the outliers, is roughly along the diagonal from the bottom left to the top right of the plot.
One simple and interesting way of finding $\theta$ that is robust to outliers is called RANdom SAmpling Consensus (RANSAC). The way it works is as follows:
- Randomly select a subset of $d$ input/output $(x, y)$ pairs. Stack the $x_i$s into a matrix $\tilde X$ ($x_i$ is the $i$th column) and the $y_i$s into a vector $\tilde y$.
- Solve $\theta^\top \tilde X = \tilde y$ for $\theta$ (provided certain conditions are met, the equation has a unique solution).
- Across the entire original dataset $X$, count inliers, or the number of $(x, y)$ pairs such that $\text{dist}(\theta^\top x, y)$ is small (the modeler chooses a threshold and an application-appropriate distance function $\text{dist}$).
In order to increase our chances of discovering a favorable parameter combination, we can repeat this process until the inlier count (perhaps as a fraction of the overall dataset size) is sufficiently high. The more times we are willing to repeat steps 1-3, the better a model we will find.
The beauty of this algorithm is that the paramter combination we end up with will have been computed from the most “normal” set of examples we encounter; in other words, the algorithm tends to avoid outliers that might adversely influence the model fit. To visually register how RANSAC is able to ignore outliers, the image below shows the much more reasonable line it would find with reasonable parameters (source):
While RANSAC is certainly elegant, there are downsides too. As we might expect, RANSAC tends to perform worse as the dataset becomes more and more polluted by outliers, though there are modifications to what we just described that increase the algorithm’s outlier tolerance. The primary disadvantage of RANSAC is that there is no guarantee that the algorithm will converge, which is a fancy way of saying that since our subset selections are random, we can’t be mathematically sure that we will eventually zero in on the optimal model parameters given our data.
Application to (old-school) computer vision
I came across RANSAC in a computer vision course at NYU taught by Robert Fergus. Near the end of the course, after reveling in the various and wondrous ways that neural networks have upended and redefined how computers process and, more recently, create, visual information, we had a final unit on techniques in computer vision that predated the deep learning revolution.
One problem from that unit that one might be interested in solving is to find some kind of correspondence between two images. For example, given the two images of the same scene on the left, we might want to produce a mosaic image like the one on the right (source):
To highlight where RANSAC comes into play, let’s suppose that we’ve waved our magic wand and (1) identified “key points” in both of the individual images, and (2) determined a set of correspondences between the key points in the top and bottom images.
RANSAC can help figure out the transformation that “happened” that caused the key points in one image to turn into their (hopefully correct) corresponding points in the other. It turns out that this transformation has parameters in and we can actually find the correspondence using RANSAC as follows:
- Randomly select a subset of matching points from both images. (The number of matching points is determined by the number of parameters to estimate, in this case, 6.)
- Find the parameters of a transformation $T$ that would turn points in one image into their matching points in the other.
- For each matching pair of points, denoted $(k_1, k_2)$, count the number of inliers, i.e., pairs for with $T(k_1)$ isn’t far from $k_2$.
After finding the transformation that corresponds to the largest number of inliers, we can use this transformation to carry out the remaining steps required to combine the images.
Conclusion
In this post, we learned about RANSAC, an algorithm that finds the parameters of a model that can handle the presence of otherwise corruptive outliers. Algorithms that operate by trying a bunch of (literally) random options are generally the stuff of novices, so it’s cool when that very simplicity turns out to solve an important technical problem.