Jack Gindi

Why the sigmoid?

Wed, 26 Nov 2025 00:00:00 +0000

Introduction

Logistic regression is often the first classification algorithm a machine learning engineer encounters. It is the workhorse of binary classification; it’s simple, interpretable, and surprisingly effective. If you have taken a standard introductory course, you have likely been presented with the model definition as a fait accompli: take a linear predictor, wrap it in a sigmoid function, and voila, you have a probability.

But why? Why the sigmoid function specifically? Why not use a clipped straight line, or a different S-shaped curve like the cumulative normal distribution? Is it just an arbitrary choice that happens to work well, or is there something fundamental about the mathematics that demands this specific shape?

Logistic regression

For various reasons we will not cover here, directly using linear regression for classification does not work well. If $x$ is $d+1$-dimensional, and $\beta = (\beta_0, \dots, \beta_{d})$ are model parameters that we adjust during training, then we can take our linear model $\beta^\top x$ and force it to only take values between 0 and 1 by pushing it through the sigmoid function. In other words, we model the probability that $y$ is 1 given features $x$ as

$$ p(y=1|x) \approx \frac{1}{1 + \exp(-\beta^\top x)} = \sigma(\beta^\top x). $$

The sigmoid function $\sigma$ has an output range between 0 and 1, but there have to be other such functions, right? Why use this one and not some other one? In the remainder of this post, we look at two ways to understand why the choice of sigmoid is deeper than it seems.

Range mapping

(For this section, we will refer to $p(y=1|x)$ as $p$.)

Assuming we want to leverage linear regression as much as possible, we want to find something that we can model using $\beta^\top x$. The most obvious thing we could try first is to model the probability directly:

$$ p \approx \beta^\top x. $$

One simple reason this doesn’t work is just that the output ranges don’t match. To put a finer point on it, since $p$ is a probability, it can only take values between 0 and 1, whereas $\beta^\top x$ can take any real value. The plot below helps to visually understand our problem:

Because of this, we need to find something else we can model with $\beta^\top x$.

Instead of modeling $p$, what if we modeled $p/(1-p)$, a quantity known as the odds? Things get a little bit better here, since the odds range from 0 (when $p$ is very small) to $+\infty$ when ($p\approx 1$). However, remember that $\beta^\top x$ can take any real value, including negative values all the way to $-\infty$! So while this is a little bit better, it still doesn’t quite work.

One thing to notice about the odds is that they cover the domain of the $\log$ function. So what if we modeled the log odds $p / (1-p)$ instead of the odds? In this case, the ranges match! When the odds are close to 0, the log odds approach $-\infty$. As the odds approach $\infty$, so do the log odds! At this point, our model is

$$ \log\biggr( \frac{p}{1-p} \biggr) \approx \beta^\top x $$

We can solve this for $p$ using some simple algebra to see that this is equivalent to a model for $p$. To do this, first we exponentiate both sides

$$ \frac{p}{1-p} \approx \exp(\beta^\top x) $$

Next, we multiply both sides by $1-p$ and distribute on the right.

$$ p \approx \exp(\beta^\top x) - p\exp(\beta^\top x) $$

Next, we move the $p$ terms to the left side and pull them out to get

$$ p (1 + \exp(\beta^\top x)) \approx \exp(\beta^\top x) $$

Finally, dividing by $1 + \exp(\beta^\top x)$, we have

$$ p \approx \frac{\exp(\beta^\top x)}{1 + \exp(\beta^\top x)}. $$

We can see that this is actually the sigmoid by dividing the numerator and denominator by $\exp(\beta^\top x)$, which gives us our the final form of our model for $p$:

$$ p \approx \frac{\exp(\beta^\top x) / \exp(\beta^\top x)}{1 / \exp(\beta^\top x) + \exp(\beta^\top x) / \exp(\beta^\top x)} = \frac{1}{\exp(-\beta^\top x) + 1} = \sigma(\beta^\top x). $$

(Remember that $1/\exp(\beta^\top x) = \exp(-\beta^\top x)$.)

This derivation is intuitive and visually satisfying. It solves the “unbounded output” problem by mapping the infinite range of a linear model to the unit interval of a probability. However, a skeptical reader might still ask: “Why did we choose to model the log-odds specifically?”

While the log-odds are a convenient choice for range mapping, they are not the only choice. We could have chosen other functions to map $(-\infty, \infty)$ to $[0,1]$ (such as the cumulative distribution of a Gaussian). To understand why the sigmoid is not just a convenient choice, but the mathematically “natural” choice, we need to dig a bit deeper.

The exponential family

Another, perhaps more principled way to arrive at the sigmoid function is to make an assumption about the conditional distribution of the target $Y$ given the input $X$. If we assume that

$$ Y|x \sim \text{Bern}(p(\beta^\top x)), $$

i.e., that given a value for $x$ and parameters $\beta$, that $Y$ is a Bernoulli random variable with success probability $p(\beta^\top x)$, then we can write the probability mass function for this distribution

$$ P(Y|X=x; p(\beta^\top x)) = \begin{cases} p(\beta^\top x) &\text{ if } Y=1 \\ 1 - p(\beta^\top x) &\text{ if } Y=0 \end{cases} $$

From this point forward, we will make our notation less cumbersome and refer to $p(\beta^\top x)$ simply as $p$.

There’s a more compact, cleverer way of writing $P(Y|X=x; p)$ that packs both cases from the previous formulation into a single expression:

$$ P(Y=y|X=x; p) = p^y (1 - p)^{(1-y)}. $$

The key is that $y$, which is binary, acts as a switch that turns on the relevant term for each case. When $y = 1$, $1-y=0$, so $P(Y=1|X=x; p)$ resolves to $p$. Similarly, when $Y = 0$, the expression resolves to $1 - p$.

Let’s first rewrite

$$ \begin{align*} P(Y=y|X=x; p) &= p^y (1 - p)^{(1-y)} \\ &= \exp(\log(p^y (1 - p)^{(1-y)})) \\ &= \exp(y\log(p) + (1 - y)\log(1 - p)) \\ &= \exp(y\log(p) + \log(1 - p) -y\log(1 - p)) \\ &= \exp((\log(p) - \log(1 - p))y + \log(1 - p)) \\ &= \exp(y\log(p/(1 - p)) + \log(1 - p)) \end{align*} $$

This is great! But why go through this derivation?

It reveals that the log-odds are what is known as the natural parameter of the Bernoulli distribution. By setting our linear predictor $\beta^\top x$ equal to this natural parameter, we are using what is called the canonical link function. This specific choice is mathematically “safe”: it guarantees that the log-likelihood function with respect to the parameters $\beta$ is concave. In practical terms, this makes the likelihood much easier to maximize, since ther are no local optima to get stuck at. To see the model we get for $p$ when we model the log odds as $\beta^\top x$, we… just have a look at the earlier section on range mapping.

Conclusion

In this post, we’ve seen two ways of motivating the use of the sigmoid function in logistic regression. The first is easier to grasp and more intuitive, but the second gives a glimpse of the mathematical depth that is often skipped over when engineers study ML for the first time. I hope you enjoyed, and happy Thanksgiving!

Building ML Paper Explorer (late 2024)

Fri, 13 Dec 2024 00:00:00 +0000

Introduction

Today, I want to talk about a recent personal project. In doing the project, I’d say I was probably 60% motivated by learning about new techniques and tools that are out there, and 40% with seeing what it would be like to put together an end-to-end application with the help of AI assistants, both within my coding environment (Github Copilot) and without (Anthropic’s Claude with a smattering of OpenAI’s ChatGPT). The goal was not to build the most optimized, low-latency, state-of-the-art tool; it was to see if I could get all the pieces working together to do something interesting.

I’ve called what I’ve built ML Paper Explorer. It is a simple interface that allows the user to search and save academic papers on machine learning. The project required me to complete tasks in four broad categories: frontend, backend, machine learning (ML), and deployment. Below, I’ll talk about some of the features I built, what the experience of leaning heavily on AI was like, and close with some broader reflections on being a software engineer in this new age.

If you want to check the project out, go to ml-paper-explorer.com.

Features

I started with a few features I thought were interesting:

Paper relevance engine: The key backend component was a machine learning engine that could search for papers relevant to a user’s query. I implemented a two pass ranking system. The first uses a fast, keyword-based (non-ML) algorithm called BM25 to filter tens or hundreds of thousands of papers down to ~1000. To further filter the smaller collection down to what I show the user (on the order of 10), I use a text embedding model to generate a numerical, semantically rich representation of the user query and find the ~10 papers whose representations (which we’ve pre-stored in a vector database) are the “closest” to it.
Personalization: Users can log in (no password required) to like papers and get recommendations based on what they’ve liked.
Explanations: When a paper is returned in response to a query, the user can see an explanation of why the paper might have been returned in response to the query. This is simply implemented using some prompt engineering on top of the query and title/abstract of the paper in question.

(As a general note, when I talk about papers here, I’m just using titles and abstracts so as to keep storage and processing times reasonable.)

Implementing the web app

Given my background, this part was quite unfamiliar to me, and to do it, I leaned heavily on Anthropic’s Claude.

Claude essentially wrote the first iteration of the frontend completely from scratch. Then, when I decided I wanted separate the initial one-page design into multiple pages, it deftly refactored the code to cleanly handle the updated organization. It even added a very aesthetic landing page without my explicitly asking for it! There were other small, probably underspecified changes I wanted to make, such as displaying a user’s login status more cleanly, adding subtle animation when search results appear, changing the formats of the cards containing paper details, or adding a navigation bar, and it was able to correctly and efficiently make those changes as well. Implementing the backend felt a little bit more comfortable and familiar, but Claude was very helpful setting up some boilerplate code that I found straightforward to extend. As I’m sure other programmers have experienced, Claude was also very useful as a debugging partner.

Deployment with modern cloud-based tools is also not an area in which I’m terribly knowledgable or adept. In order to be fully operational online, I had to deploy:

The backend
The frontend
A database to hold information about paper metadata and users’ liked papers
A vector database for paper similarity search

Enter Claude once again! It identified available services I could use to host the various pieces (though I ended up hosting the frontend with AWS Amplify rather than Vercel) and then helped me stumble my way through getting all of the pieces to connect and talk to one another. It helped me navigate some of Amazon’s web interfaces by looking at screenshots, and also interpreted and explained various error messages that would have taken me longer to resolve on my own.

I should stress here that without Claude’s help, even as a full-time software engineer, this same project that took me a few short weeks to get off the ground would have likely taken me several months, if not much longer.

Reflections

While it was exciting to get something up and running so quickly, doing this project sparked a few thoughts about what the emergence of tools this powerful might mean for me as a software engineer going forward.

First, it seems less and less important than ever for information to live in my head. Not so long ago, to build this simple application I’ve described, I would have had to have reasonable command of web development, dev ops (for deployment), and machine learning fundamentals. I was able to get by with little-to-no up-to-date knowledge about two out of three of those. Would I have been able to do this same project without any programming knowledge whatsoever? I’m not sure we’re there yet. But I was surprised how little I needed to get started.

With generative AI looking like it will intermediate more and more parts of our work lives, it seems much more important to be able to articulate what you’re looking for than to have a lot of pre-loaded a priori knowledge. Another way of saying this is that our ability to accomplish nontrivial things seems to be decorrelating from the amount of time we’ve spent learning about them. Given how much time I’ve spent studying computer science, math, and machine learning over the last decade, I find this unsatisfying! While I still believe that at this stage, deep understanding and investigation still helps me produce my best work, will that be the case if these models keep on improving at this pace?

The second thing that occurred to me is the way that a certain attribute of generative AI tools that disappoints some people makes it excellent as a coding partner. When people prompt AIs for things like essays and poems, their complaint is often that it’s too… well… average. “That essay would get a B+,” they say, “but I certainly wouldn’t give it an A.” I think some disappointment about the quality of AI writing can be boiled down to the fact that it feels sterile and derivative. As a programmer, though, this “average” quality is actually exactly what you want! When asking an AI for help with programming, what you’re looking for is often the consensus opinion about the best way to solve this or that problem. The sort of averaging or convergence that occurs when you compress the entire internet into the parameters of a language model ends up making models like Claude and GPT very helpful programming partners, and frustratingly boring writers. (I think it’s certainly possible to get AI to write things that are interesting, but it usually takes effort and clever prompting.)

While I am admittedly somewhat uncomfortable about the ways that software engineering is going to change – on a shorter timeline than I thought it might – I do believe that humanity will ultimately figure out how to leverage these AI technologies to create a better world. In the near-to-medium term, we will have to be extraordinarily careful about ethically and safely applying them (or not) to sensitive areas like education, the military, biomolecular design, or our financial system, but if we can navigate those challenges successfully, I sincerely believe there is tremendous potential.

It’s very possible, even likely, that in that new and hopefully improved world, you’ll find me with an old computer, disconnected from the internet, coding, unassisted, like we did in the before times.

A bound on sorting performance

Sun, 17 Nov 2024 00:00:00 +0000

Introduction

If you wake any programmer up in the middle of the night and ask them to name an algorithm, a sizable fraction would probably invoke some kind of sorting procedure. Some might name quicksort, some merge sort, still others insertion sort, and some might troll you by naming Bogosort.

The first three of those algorithms are all what are known as comparison-based sorts: all of them work by comparing elements and making decisions based on the results of the comparisons. In this post, I want to talk about a lower bound on efficiency for comparison-based sorting algorithms. In other words, I want to show that if you invented a new comparison-based sorting algorithm, then even without knowing how it works, I could tell you what its best conceivable runtime is (as a function of the input size).

To get a better sense for what I mean, let’s dive in.

Comparison-based sorting

To understand what we mean by comparison-based sorting, let’s walk through one of the algorithms I mentioned earlier: merge sort.

Merge sort essentially works by sorting the first half of the input, sorting the second half, and then merging the two sorted results. But how do we sort the first and second halves? We sort the first half of the first half, sort the second half of the first half, then merge them. And so on and so forth. In order for this recursive process to work, though, the process has to bottom out, right? Right! It bottoms out when a “half” is empty or has one element, since empty and singleton lists are (trivially) sorted.

To show this with an example, let’s say we start with the input list [1, 3, 8, 4, 5, 2, 6, 9, 7]. We would

Split the list into two halves: [1, 3, 8, 4] and [5, 2, 6, 9, 7].
Split the first half into two halves: [1, 3] and [8, 4].
Split the first half into two halves: [1] and [3].
Each of [1] and [3] is sorted, so we merge them into [1, 3].
Split [8, 4] into two halves: [8] and [4].
Each of [8] and [4] is sorted, so we merge them into [4, 8].
Now we merge [1, 3] and [4, 8] into [1, 3, 4, 8].
Carry out the same recursive process for [5, 2, 6, 9, 7] to get [2, 5, 6, 7, 9].
Merge [1, 3, 4, 8] with [2, 5, 6, 7, 9] to get the final result: [1, 2, 3, 4, 5, 6, 7, 8, 9].

The “comparison"s that happen in merge sort occur in the merging stage, which we won’t go into detail about here. Now that we’ve seen one example of a comparison-based sort, we will now turn to thinking about sorting more generally using decision trees.

Performance bound

So how can we possibly say anything important about the efficiency of a whole class of algorithms without considering every possible implementation?

First, let’s suppose we have some input list of size $n$. The indices of this list – i.e., the numbers $1, \dots, n$ – has $n! = n \cdot (n-1) \cdot (n-2) \cdot \dots \cdot 3 \cdot 2 \cdot 1$ possible orderings, exactly one of which puts the elements in sorted order. We want to say something about the minimum number of comparisons required to find this ordering.

One way to think about this sorting problem is to use the abstraction of a decision tree. To make this more specific, the leaf nodes (the nodes the bottom of the tree) each represent one possible ordering of the list. The other nodes (called internal nodes) represent comparisons between elements at different indices of the list. An example of this tree is shown in the image below (source):

Each of the ovals represents a comparison between the elements at two indices of the array. To understand how to read this tree, let’s say that our input array is called $A$. At the root node of the tree, if $A[1] \leq A[2]$, then we would proceed to take the left branch and compare $A[2]$ with $A[3]$. If $A[2] \leq A[3]$, then we would take the left branch again and reach the leftmost leaf, which would indicate that $A$ was already in sorted order. With other input orderings, though, the index order that results in $A$ being sorted might be some other leaf, which we could similarly determine by doing a bunch of comparisons. (The key to avoiding confusion here is to remember that the numbers in the ovals are indices, not the actual elements of the input list.)

Now, if there are $n!$ possible orderings of $A$, then there must be $n!$ leaves in the tree. Furthermore, we know that the length of the longest root-to-leaf path is the largest possible (worst-case) number of comparisons we would need to do to get our sorted order. Thus, in order to understand the best possible worst-case performance of our comparison-based sort, we want to find the length of the longest possible root-to-leaf path in this decision tree.

Let’s suppose that the algorithm always completes after $h$ steps. Another way of stating what we want to find is to say we’re looking for a lower bound on $h$. With $h$ comparisons, we can distinguish between $2^h$ orderings, since each comparison has two possible outcomes and the indices are distinct (even if the elements aren’t). In order to make sure we find the sorted order, we need to make sure that the $n!$ possible orderings can all be covered with $h$ comparisons (i.e., by checking at most $2^h$ orderings). In other words, we need this inequality to hold:

$$ 2^h \geq n!. $$

Taking the log (base 2, as is customary in computer science) of both sides, this can be rewritten as

$$ h \geq \log(n!). $$

That’s great!… But what is $\log(n!)$? On the one hand, $n!$ is huge, but on the other, maybe the $\log$ tames it? Well, we know that $n! \geq n(n-1)\dots (n/2) \geq (n/2)^{n/2}$, so we can rewrite our inequality again as

$$ h \geq \log(n!) \geq \log((n/2)^{n/2}) = \frac{n}{2}\log \biggr( \frac{n}{2} \biggr). $$

(The equality holds because of a property of logarithms: $\log(a^b) = b\log(a)$.) Ignoring constants, we get that $h$ must is bounded below by $n \log n$. To put it in a way that underscores how cool this proof is, what we’re saying here is that no comparison sort can work using a worst-case number of comparisons that is (ignoring constants) smaller than $n \log n$. Again, we did this without looking at any particular implementations!

Coda: can we do better?

There’s one question left to answer: What if we relax the requirement that our algorithm be based on comparisons? Can we achieve a better worst-case performance than $n \log n$?

The answer is yes, and if our inputs follow a couple of additional (important) assumptions, we can do it with a pretty simple algorithm at that. If the elements of the input list are nonnegative integers that take values up to some maximum $M$, we can use the following algorithm:

Create a list $C$ of zeros of size $M$. The counting array $C$ effectively acts as a frequency table for the input, where $C[j]$ holds the count of occurrences of the integer $j$.
For each element of the array, increment the count at that element’s index. In other words, if you see a 5, increment $C[5]$.
Once you’ve iterated through the entire list, iterate over $C$ and add $C[j]$ copies of $j$ to the output list.

This algorithm, called counting sort, is probably the most famous non-comparison sort. If the initial array has $n$ elements and $C$ has size $M$, then the algorithm takes $n + M$ steps to complete (ignoring constants). This can be good or it can be bad. It can be awesome if $M$ is not too much larger than $n$, since then (ignoring constants again), it would take approximately $n$ steps, which is faster than our comparison-sort lower bound of $n \log n$. If $M$ is very large, however, say $M \approx n^2$, then we’ve lost our performance edge. This algorithm also doesn’t work with non-integers, which makes it less generally applicable than we’d like.

Conclusion

In this post, we started with a hands-on example, established theoretical bounds using a decision tree formulation of the sorting problem, and finally explored how changing our assumptions about the inputs can unlock faster algorithms.

Sorting algorithms are a cornerstone of computer science, and understanding their limits helps us appreciate their clever design and implementation. The balance between theory and practice highlights the necessity of mathematics to the design of efficient algorithms that power our world.

Existence vs construction

Wed, 17 Jul 2024 00:00:00 +0000

Introduction

In everyday life, we show that things exist by producing an example. In order to believe that black sheep exist, I would need someone to show me one. To believe the claim that pigs can fly, I would need to actually observe a flying pig. If you can show me the thing, then I will believe it exists.

In mathematics, existence is a more abstract concept. Under certain assumptions about logic, a mathematical object can be proven to exist without ever being explicitly constructed or observed. For instance, we can show that a function $f$ has a root $x_0$ (this means that $f(x_0) = 0$) without knowing the concrete value for $x_0$. This can feel counterintuitive, but in the mathematical world, these two concepts are actually separate, with the task of construction often significantly more difficult than the task of existence.

In this post, we will take a high-level tour through a few examples of problems for which existence is much simpler than construction. Departing from our everyday intuition that existence and construction go hand in hand requires creativity and willingness to think outside the box. Without further ado, let’s jump in!

Testing for Convergence vs. Finding the Sum

One area in which we can readily observe the difference in difficulty between showing that something exists and constructing it concerns infinite sums. Typically, when we’re dealing with infinite sums, there are two questions we care about:

Is the sum finite?
What is the sum? (This question only really makes sense if the answer to (1) is yes.)

The Basel Problem, posed in 1650 by Pietro Mengoli and solved by Euler in 1734, asks for the sum of the infinite series $S = \sum_{i = 1}^n \frac{1}{n^2}$. Showing that $S$ is finite – i.e., that a finite sum exists – simply requires observing that $n$’s exponent in the denominator (2) is greater than 1. Finding the sum however, requires a more elaborate argument which we won’t spend time on here. Spoiler alert: $S = \pi^2 / 6$! What the irrational, transcendental $\pi$ has to do with this very integer-y sum is not obvious, but it’s another story for another post.

Establishing the fact that $S$ is finite just required a simple test. Producing the actual sum, however, took almost 85 years! This is weird! Next, we’ll look at an example from number theory where the difficulty of construction is the foundation upon which cybersecurity rests.

Primality Testing vs. Integer Factorization

One of the most deeply studied parts of number theory – the study of properties of the positive whole numbers – is the prime numbers. A number $p$ is prime if its only divisors are $p$ and $1$. So how do we tell if a number is prime?

The simplest algorithm is to enumerate the numbers between $2$ and $p - 1$. If one of them divides $p$ evenly, the number is not prime. A slightly better way to do it is based on the fact that if $p = ab$, then unless $p$ is a perfect square, then either $a < \sqrt{p}$ and $b > \sqrt{p}$ or vice versa. Thus, instead of searching all the way up to $p$, we can search up to $\sqrt{p}$. But we’re not done! Another observation we can make is that since, at bottom, all numbers are products of primes, we only need to search the primes up to $\sqrt{p}$.

Much more (very sophisticated) work has been done improving these methods (and some of these improvements may at some point be the subject of another post), but the important thing to say here is that we now have algorithms that run relatively quickly for establishing the primality of a natural number. These efficient methods are often unlike the methods I mentioned earlier, since they do not rely on “trial divisions” (where we check different candidate divisors by dividing $p$ by them).

Suppose now that instead of deciding if a number is prime, we change the problem statement slightly to: Find the prime factors of $p$. While it seems like being able to efficiently answer the “Is it prime?” question should require finding the factors along the way, once again, this turns out not to be the case. As of now, there are no published “efficient” integer factorization algorithms, though there are algorithms that are “almost efficient”. (Here, efficiency is a measure of how long it takes to factor an integer $n$ as a function of $n$’s size.)

Many of the cryptographic protocols responsible for securing data on the internet rely on the (likely) computational hardness of factoring numbers efficiently. If an efficient algorithm were to be found, the internet could become a far less safe place to entrust with our credit card information.

The Probabilistic Method

The probabilistic method, often used when studying finite structures, is a very interesting and general technique that manages to cleverly prove that an object with certain (rare) properties exists without actually finding one. The method does this by using randomness in the following ingenious way. First, we construct a “random” instance of the object. Then, we show that the object has the desired property with nonzero probability. Since, if we sample an object at random, the “probability” of drawing the object with the properties we care about is

$$ \frac{\text{\# of configurations with property}}{\text{\# of configurations}}, $$

if the probability is nonzero, it means there must be at least one instance that has our property. Without using the method to actually carry out a proof here, I’ll set a problem up to give a “concrete” sense of a problem we can solve with this.

A graph is a collection of nodes connected to one another by edges. A complete graph is a graph where every node is connected to all of the other nodes. If such a graph has $n$ nodes, then it has $n(n - 1) / 2$ edges. Examples of complete graphs on $n = 2, ..., 7$ vertices are shown below. We typically refer to the complete graph on $n$ vertices as $K_n$.

The question we care about here is: Given a complete graph with $n$ vertices ($K_n$) and an integer $r$ (where $r < n$), is it possible to color each edge either red or blue in such a way that no group of $r$ vertices ($K_r$) has monochromatic connecting edges?

(Read that again if you need to.)

The proof begins by constructing a “random” graph where each edge is colored red or blue independently at random. Using the fact that the graph is random (in the edge-coloring sense) and skipping over some details, it turns out that if for a particular choice of $n$ and $r$,

$$ \frac{n!}{r!(n-r)!} 2^{1 - r(r - 1) / 2} < 1, $$

then such an edge-coloring exists. In other words, given our initial question with some specific values of $n$ and $r$, you can determine if such an edge-coloring exists just by plugging those values into the left side of the inequality and seeing whether the result is smaller than 1. Note, however, that this plug-and-chug way of answering the question gives us no information about how to color the edges to see the actual coloring!

Here, again, we have some clever way to answer the existence question while still being at square zero in terms of how we might go about construction.

For this and other combinatorial problems, one can imagine sitting down and drawing some small examples to try to gain an intuition for what a valid coloring might look like. Maybe you draw a few examples for $n=5$ vertices (ten edges) or $n = 10$ vertices (45 edges). Then you pick a few values of $r$ and realize (if you’re lucky and persistent) that a few values of $r$ work while others don’t, and you start to feel like getting intuition for this problem from examples might be more difficult than you thought. The probabilistic method allows us to sidestep the problem of construction and instead prove existence in a way that highlights the surprising power of abstract reasoning in tackling what are otherwise mind-crushingly complex combinatorial problems.

Self-supervised learning

Wed, 17 Jul 2024 00:00:00 +0000

Introduction

Many of the most powerful models you see out there today are built using a two-stage approach:

Pre-training: A model learns generally useful representations (e.g., of language or images) without reference to any particular task (e.g., sentiment analysis or image classification).
Fine-tuning: Using the pre-trained model as a starting point, we fine-tune the model to accomplish some additional objective, such as safety or the ability to perform some downstream task.

In theory, the benefit of this approach is that once you have a powerful pre-trained model, you can build many different task-specific and/or fine-tuned models on top of it with relatively little additional effort. But this benefit is only realized if our pre-trained model has comprehensively discovered useful features of language, images, or both, which requires TONS of diverse, carefully curated data. From a machine learning perspective, this presents some thorny questions:

It’s one thing to train a model for something specific, like classifying X-ray images of tissue as cancerous or not, but how do we set up a training objective that helps a model learn… well… very general and woefully underspecified “useful features of languages, images, or both”?
The most widely used and successful traditional machine learning approaches have in the past required data that has been labeled, or annotated with correct outputs. But if we want to unlock the use of larger and larger datasets for training, obtaining sufficient high-quality labels can be prohibitively expensive, if it’s possible at all. Can we accomplish our pre-training without hand-labeled data?
Even assuming we could get whatever quantity of labeled data we needed, what kinds of labels would be useful in gaining a general understanding of language or visual data?

One paradigm that has recently come to the fore is call self-supervision, wherein instead of hand-labeling each example, we actually generate the training signal from the raw data itself. To understand the significance of self-supervision, let’s first revisit the foundations of fully supervised learning. This will set the stage for appreciating how self-supervised approaches build upon and extend these principles to overcome the challenges of labeled data scarcity.

Supervised learning

From thirty-five thousand feet, supervised learning works like this: We have a dataset of $(x, y)$ pairs, where the $x$s are called examples, and the $y$s are called labels. For instance, $x$s might be an image and $y$ might be a binary label that is 1 if the image contains a cat and 0 otherwise. In a supervised setting, during training, the model makes a cat/non-cat prediction for an image $x$, and then that prediction is checked against the label $y$. The closer the prediction – which is usually a probabilistic score rather than binary – is to the label, the less our parameters need to move in response to this $(x, y)$ pair.

The key here – and the problem – is that for the above to work, we need a $y$ for every $x$! As noted earlier, depending on the problem, labels can be very expensive to generate. For object recognition tasks, for example, we need every object of interest to have been outlined and labeled in every image… across hundreds of thousands or millions of images and across hundreds or thousands of different object categories! At scale, this is both time consuming and hard to do correctly.

To make the jump over to self-supervision, we need to come up with useful objectives or tasks that models could use during training that only require $x$s. As a first example of how we can generate a training signal from unlabeled data, let’s talk about autoencoders.

Autoencoders

An autoencoder’s job is to take some big input thing and make it smaller. That is, we use autoencoders to take some complex object and turn it into a (relatively) low-dimensional vector of numbers (an embedding) that contains most of its relevant semantic content and is neural-network-compatible. Assuming the representation captures important information, these representations can then be used for downstream tasks like finding farms in satelite photos, or finding duplicate photos in a photo library.

We do this compression in the following way. Let’s say you want to compress an object $x$. The autoencoder has two components:

An encoder $E$: maps from our high-dimensional input space into our embedding space.
A decoder $D$: this piece maps from the embedding space back into the input space.

(Note: Both $E$ and $D$ are smaller neural networks whose parameters are updated during training. Once the system has been trained, we typically throw out $D$ and just use $E$ to generate embeddings.)

During training, the model is shown many examples $x$. For each one, it computes $e = E(x)$, and then $x' = D(e)$. Here, $e$ is what we call the embedding of the object. With autoencoders, we use the original $x$ as our “label”! The way we determine whether or not our embedding $e$ is good is by whether it contains enough information to be re-expanded back into $x$! Mathematically, our representation is good if $\text{distance}(x, x')$ is small.

This is our first example of actually manufacturing a supervised problem from unlabeled data by cleverly defining our objective. In this case, we define our training objective to be that our reproductions of inputs (images, for instance) should be as close as possible to the original inputs themselves. By using this training task, we obviate the need for labels, but are nonetheless able to learn something useful.

Next, we’ll look at a pair of training tasks used in conjunction to train a model called BERT, a precursor to many of the large language models available today.

Masked language modeling (MLM) and next sentence prediction (NSP)

In 2018, Google released a now-famous paper detailing a transformer-based language model called BERT. What is interesting about it for our purposes is the authors’ choice of training tasks. Here, again, as we’ll discuss, they found ways to generate a helpful training signal from the raw data itself, which in this case is text data. We will discuss each of their two tasks in turn: masked language modeling and next sentence prediction.

Masked language modeling (MLM)

For this task, given a piece of text like

The quick brown fox jumped over the lazy dog.

we randomly mask a small fraction of the tokens to obtain something like

The quick [MASK] fox jumped over the [MASK] dog.

After doing its best to take the surrounding context into account, the model makes a prediction about which tokens the [MASK] tokens obscure. To be a little bit more precise, a prediction here is not just a single token like monkey or brown; it is actually a probability distribution over all possible tokens. In other words, for the first [MASK] token, the model might output something like

Word	Probability
brown	0.07
monkey	0.002
excavator	0.05
…	…

(In reality, the vocabulary size over which the distribution is constructed is much larger than 3; it is often on the order of 10s or 100s of thousands.)

To quantify how well we’re doing at any point during training, we look at the distance between the output distribution (like the one in the table) to the “correct” distribution, where the token that was actually masked is assigned probability 1 and all other tokens are assigned probability 0.

As training progresses, with enough diverse data, the model will gradually produce spikier distributions, i.e., distributions where the probability of the right answer gets closer and closer to 1, and all other options tend to 0. Further along in training, the distribution I wrote just above might be something like:

Word	Probability
brown	0.7
monkey	0.0001
excavator	0.025
…	…

The BERT authors use this as one of a pair of tasks that help the model learn nontrivial statistical properties of the unlabeled training text. While masked language modeling helps the model grasp contextual relationships within a sentence, understanding the relationship between different sentences is equally crucial. This brings us to the next task used in BERT’s training: next sentence prediction.

Next sentence prediction

In addition to an ability to represent individual words well, the BERT authors also wanted the model’s representations to be suited to tasks that depend on an understanding of relationships between two pieces of text, which is not captured by the MLM task. To mix in an emphasis on this capability, the authors added another task called next sentence prediction (NSP), which works as follows:

Select a sentence A from the training corpus.
Select a sentence B from the training corpus. With a probability of 0.5, B is the sentence immediately following A, and with a probability of 0.5, B is some other, randomly selected sentence from the corpus.
If B comes after A, this (A, B) pair is labeled with IsNext. If B is random, the pair is labeled NotNext.

During training, the model tries to learn to predict the correct label for each pair.

(Note for the slightly more technically inclined reader: If you think about how we would actually do this binary classification, it’s not so straightforward. What we actually do is we set aside a special token, usually designated [CLS]. Each transformer layer mixes together the representations of the tokens output by the previous layer, so after the input has made its way through those layers, this [CLS] token has contextual information from the actual sentence tokens mixed into its final representation. The embedding of this [CLS] token is then fed to a binary classifier.)

During BERT’s training, the authors actually use both MLM and NSP and combine the error signals from each of them to update the model parameters. One interesting thing to note here is that these tasks don’t target any concrete feature of language, such as telling the difference between nouns and verbs, or correctly predicting correct verb tenses. Rather, we specify fuzzy concepts like being able to guess words from context and being able to tell when one sentence immediately follows another, and trust that the model will have to learn important language features in order to do those things well.

With both MLM and NSP, we’ve seen how self-supervision can uncover rich linguistic features. The versatility of self-supervision is also evident in the domain of computer vision, where contrastive learning techniques have proven highly effective. Let’s delve into one common contrastive learning training objective and its applications in generating powerful image embeddings.

Contrastive learning

Suppose that we want to train a model $M$ to produce generic image embeddings. One way we can do this is as follows. First, we produce a triplet of embeddings:

Select an image $x_1$ and compute its embedding $e_1 = M(x_1)$ .
Apply an augmentation to $x_1$ and compute the embedding of the augmented image $e_1' = M(x_1')$. (There are a wide variety of augmentations we can apply, including crops, rotations, or color inversions.)
Choose another random image $x_2$ from the dataset and compute its embedding $e_2 = M(x_2)$.

Contrastive objectives try to push embeddings of similar objects (e.g., an image and its crop) closer together, while pushing embeddings of unrelated objects (e.g., two random images) apart. We can set up an objective to do this using our three embeddings as follows:

$$ \mathcal L(e_1, e_1', e_2; M) = \max(\text{distance}(e_1, e_1') - \text{distance}(e_1, e_2) + \alpha, 0) $$

Let’s have a think about what’s going on here. The function $\mathcal L$ takes positive values when

$$ \text{distance}(e_1, e_1') - \text{distance}(e_1, e_2) > \alpha. $$

That is, when the distance between embeddings similar objects is larger than the distance between embeddings of unrelated objects by more than $\alpha$ (we choose $\alpha$), the model needs to be adjusted. We would take that positive loss and make updates to $M$’s parameters proportionally to how much each contributed to the loss.

With a large enough dataset of diverse images and enough training, this leads to image embeddings that can be applied to a range of downstream applications and tasks.

Conclusion

The shift from relying solely on traditional supervised learning to pre-training using self-supervision marks a significant evolution in machine learning, offering solutions to the challenges of scale imposed by the costs obtaining high-quality labeled data. By creatively designing training tasks that derive supervision from the data itself, we can harness the power of large, unlabeled datasets to pre-train models that can then be fine-tuned for specific applications.

From autoencoders and masked language modeling to next sentence prediction and contrastive learning, each method reveals both the ingenuity and surprising effectiveness of simple techniques in developing versatile and powerful models. This paradigm not only streamlines the development process but also paves the way for more adaptable and scalable AI systems.

Thanks for reading!

The connection between k-means and Gaussian mixtures

Thu, 04 Apr 2024 00:00:00 +0000

Introduction

In the next few posts I want to take you on the learning journey that happened when I tried to wrap my head around this wonderful paper by Brian Kulis and Michael I. Jordan (yes, Michael Jordan). The paper is about a variant of the k-means algorithm inspired by a Bayesian way of thinking about clustering. This helps solve a significant problem with k-means, namely that we don’t know the optimal value of $k$ in advance, and, furthermore, that there often is not even a way to make a reasonable guess of what a good value might be for a particular dataset.

In this post, we will go over the contents of section 2.1 of the paper, which is about a connection between k-means and a mixture of Gaussians model. In future posts, we will introduce a Bayesian approach to clustering, and see how that perspective is more than just philosophical; it can help us develop a new algorithm that is more flexible than k-means.

If you’re not familiar with k-means, I’d recommend reading up on it before proceeding. The Wikipedia page is a good place to start.

Background

We are going to develop the k-means algorithm in a somewhat non-standard way, but first we need to review a few preliminaries.

Mixtures of Gaussians

A mixture of Gaussians model can be succinctly expressed as follows:

$$ \begin{align*} p(x) = \sum_{i=1}^k \pi_i N(x | \mu_i, \Sigma_i). \end{align*} $$

This expression says that under our model, the probability assigned to sampling a point near $x$ is a mixture of the probabilities of sampling a point near $x$ according to each of $k$ different Gaussians. The mixing coefficient $\pi_i$ represents the amount of weight we place on the $i$th Gaussian (they are nonnegative and sum to 1), and $\mu_i$ and $\Sigma_i$ are the mean(vector)s and covariance(matrice)s, respectively.

One way to interpret this model that I find intuitively appealing is as an assumption that each point in your dataset is generated from one of $k$ distinct (Gaussian) generating processes. To sample a new point from this distribution, you can first sample a value $i$ using the $\pi_i$, and then subsequently sample a point from the corresponding Gaussian. (Note that this assumption may or may not be suitable for a particular dataset or application. As George Box has famously said: “All models are wrong, but some are useful.”)

EM algorithm

The EM algorithm is a way of carrying out maximum likelihood estimation in cases where the model is constructed with latent variables. Latent variables are often used to account for or represent characteristics of a population that cannot be directly observed. As an example, someone’s political orientation might not be directly observable, but if you ask questions about certain issues, you might be able to infer it.

The algorithm tries to find an optimal pair of quantities: the values of the latent variables and the parameter values. To do this, we alternate between two steps

Expectation step

The output of this step is a function that takes a potential parameter setting (i.e., a model) $\theta$, and outputs a score. (Higher scores are better.) The obvious question: How do we compute that score? Let’s say that $z$ is a possible latent variable setting. The score corresponding to $z$ is $\log p(x, z | \theta)$, i.e., the (log) likelihood of observing the data $x$ and $z$ assuming the model $\theta$. Since a “probable” setting of $z$ depends on the parameters we’ve estimated so far and the observed data $x$, the overall score for $\theta$ takes a weighted average of the scores across possible values of $z$, where the weights are the likelihood of observing $z$ given the latest parameters and the data $x$.

To summarize using notation we’ll refer to in the next section, the output of this step is a function we’ll call $Q$. To formalize the idea in the previous paragraph, we define $Q$ as follows:

$$ \begin{align*} Q(\theta; x, \theta^{(t)}) = E_{z \sim p(\cdot | x, \theta^{(t)})} [\log p(x, z | \theta)]. \end{align*} $$

Here, $\theta^{(t)}$ is our running parameter estimate; the $t$ is a subscript, not an exponent. At each step, we have access to a concrete value of $\theta^{(t)}$. As the algorithm progresses, we update $\theta^{(t)}$ to $\theta^{(t+1)}$ and so on. In contrast to $\theta^{(t)}$, $\theta$ is a placeholder for a parameter setting that we are constructing $Q$ to evaluate. In the same way that when we define $f(x)$, we don’t know anything about $x$ except that it’s an input to the $f$, similarly, when we work with $Q(\theta; x, \theta^{(t)})$, we don’t know anything about $\theta$ except that it’s a potential parameter setting we might want to test.

The semicolon in the definition of $Q$ is just a way of indicating that the function also depends on the data $x$ and the latest parameter setting $\theta^{(t)}$, but that we’re treating these as known (whereas $\theta$ is a variable).

Maximization step

Once we have the function $Q$, we simply maximize the function over $\theta$. The concrete update rule depends on the model we’re working with, but the general idea is to find the parameter setting that makes the data most likely. In general mathematical terms we are looking for

$$ \begin{align*} \theta^{(t+1)} = \arg\max_{\theta} Q(\theta; x, \theta^{(t)}). \end{align*} $$

We alternate between the expectation step (forming $Q$ using $\theta^{(t)}$) and maximization step (finding the parameter value that maximizes $Q$) until the algorithm converges.

K-means = EM + mixture of Gaussians + a limit

For each point $x_i$, we can imagine that there is a true, but latent, cluster assignment vector $z_i \in \text{one-hot}(k)$. Each entry $z_{ic}$ of $z_i$ is 1 if $x_i$ is a member of cluster $c$ and 0 otherwise. (Note that only one component of $z_i$ can be 1 for each $i$.) In our setup, since the cluster assignment is latent, $x_i$’s cluster membership is smeared across the different clusters probabilistically. We express this idea of a soft assignment by saying that there is a probability, $\gamma(z_{ic})$, that the point $x_i$ is in cluster $c$. (In contrast, k-means makes a hard assignment, where each point is assigned to exactly one cluster, in which case $\gamma$’s output would be binary. We’ll get to that in a moment.)

Let’s assume that the $x_i$ are drawn from a mixture of Gaussians with mixture coefficients $\pi_i$ for $i = 1,\dots, k$, but with the additional assumption that for each $i$, $\Sigma_i = \sigma I$. That is, assume that the covariance matrices for each Gaussian are all diagonal and have the same value $\sigma$ on their diagonals.

The entries of the $\gamma(z_i)$ (the vector of $\gamma(z_{ic})$ for $c = 1, \dots, k$) can be expressed as

$$ \begin{align*} \gamma(z_{ij}; \theta^{(t)}) &= \frac{\pi_j N(x_i | \mu_j, \sigma I)}{\sum_{l=1}^k \pi_l N(x_i | \mu_l, \sigma I)} \\ &= \frac{\pi_j \exp\left(-\frac{1}{2\sigma} \|x_i - \mu_j\|^2\right)}{\sum_{l=1}^k \pi_l \exp\left(-\frac{1}{2\sigma} \|x_i - \mu_l\|^2\right)}. \end{align*} $$

We write $\gamma(z_{ij}; \theta^{(t)})$ to emphasize that the $\gamma$’s are computed using the latest parameter estimates $\theta^{(t)}$.

The denominator here looks complicated, but it’s just there to make sure that the entries of $\gamma(z_i)$ sum to 1. The numerator is roughly the probability that $x_i$ was generated by the $j$th Gaussian. (The exact expression is a bit more complicated, but this is the intuition.)

The k-means algorithm can be seen as a limiting case of the EM algorithm applied to this mixture of Gaussians setup.

The E-step

The TL;DR is that the E-step of the EM algorithm computes the $\gamma(z_i)$ for each $x_i$ using the existing $\mu_j$s and $\pi_j$s. I want to be slightly more precise, though, about how this fits within the abstract framework of the EM algorithm that we outlined earlier.

The E-step computes $\gamma(z_i)$ as above. To map this onto the abstract definition of the E-step, we can first note that for a particular example $x_i$, we can compute the (log) likelihood of $x_i$ being generated by the $j$th Gaussian as

$$ \log p(x_i, z_{ij} = 1 | \theta) = \log \pi_j N(x_i | \mu_j, \sigma I). $$

Here, $\theta$ is the set of all parameters to evaluate, i.e., the $\pi_i$ and $\mu_i$. To compute the expected value of this quantity for $x_i$ over all possible latent variable settings, we can use the $\gamma(z_{ij})$ we computed above to weight the contribution of the log probability for each latent setting $z$:

$$ \begin{align*} Q(\theta; x_i, \theta^{(t)}) &= E_{z \sim p(\cdot | x_i, \theta^{(t)})} [\log p(x_i, z | \theta)] \\ &= \sum_{j=1}^k \gamma(z_{ij}) \log \pi_j N(x_i | \mu_j, \sigma I). \end{align*} $$

To extend this expectation to the entire dataset, we can simply sum over all the $x_i$:

$$ Q(\theta; x, \theta^{(t)}) = \sum_{i=1}^n \sum_{j=1}^k \gamma(z_{ij}; \theta^{(t)}) \log \pi_j N(x_i | \mu_j, \sigma I). $$

Note that in the equation above, the $\mu_j$ and $\pi_j$ are the parameters that we are evaluating with the function $Q$, whereas the latest means and mixture coefficients we’ve estimated so far (i.e., those represented by $\theta^{(t)}$) are used to compute the $\gamma(z_{ij})$.

The M-step

Since we’ve fixed the covariances, i.e. are not parameters that need to be found, the M-step updates the $\mu_j$ to be the weighted average of all of the $x_i$. Each point’s contribution to the computation of $\mu_j$ is weighted by the corresponding $\gamma(z_{ij})$, or how strongly it is attracted to $\mu_j$. This makes intuitive sense; the farther away a point is from a given mean, the (exponentially) less impact it has on that mean’s update.

Putting it all together

The final ingredient we need to make the rigorous connection to k-means is to note what happens when $\sigma$ gets smaller and smaller. As we assume that the $k$ Gaussians in our mixture become narrower and narrower, points that are not very close to the means have their likelihoods exponentially decay to 0. Since the sum in the denominator of the expression for $\gamma(z_{ij})$ becomes dominated by the term with the smallest distance to $\mu_j$, as $\sigma \to 0$, $\gamma(z_{ij})$ tends to 1 when $j$ minimizes $||x_i - \mu_j||^2$ and to 0 otherwise. This “one-hot"ing of the $\gamma(z_i)$ is the same as saying that letting $\sigma$ go to 0 turns our soft assignment into a hard assignment, which is exactly what k-means is designed to do.

To summarize: The k-means algorithm can be seen as the application of the EM algorithm to a mixture of Gaussians when we assume that the covariance matrices are fixed to $\sigma I$ and we let $\sigma$ go to 0.

Conclusion

That’s all for now. I’ve always found it quite satisfying to see how different, seemingly disparate theorems, algorithms, models, etc. can come together in unexpected ways.

In the next post, we’ll start to develop our Bayesian perspective to develop the new clustering algorithm I mentioned in the intro. Stay tuned!

Simulated annealing

Fri, 08 Mar 2024 00:00:00 +0000

Introduction

Optimization problems are everywhere!

Whether it’s finding the most efficient way to deliver packages to customers, determining the best next move in a game of chess, or figuring out how to adjust the parameters of a gigantic machine learning model, many important practical problems are, at their cores, optimization problems. In this post, we will learn an optimization meta-algorithm called simulated annealing, a general approach to (approximately) finding global solutions to optimization problems… which is, interestingly, inspired by a physical process from material science.

Simulated annealing: overview

Annealing

Before discussing its algorithmic analog, we should sketch out what annealing is and how it works. Annealing is a process that alters the physical and chemical properties of a metal so that it can be worked more easily. It begins by heating the metal above its recrystallization point in order for it to enter a state in which its chemical states can change more freely. We then slowly cool the meatal to allow it to settle into a chemically superior state.

Relationship to optimization

One way to solve an optimization problem is to carry out the the following iterative process:

Start in some state (e.g., a chessboard just after it has been set up).
Transition from the the current state to a new state that most decreases the value of your an objective function you want to minimize (e.g., make a move that decreases your probability of losing).
Repeat step 2 until a stopping condition is met (e.g., the game ends).

Simulated annealing modifies this process. It would instead look something like this:

Start in an initial state.
Sample a random candidate state to transition to.
With some probability that depends on the current and candidate state, accept the candidate transition. Otherwise, stay put.
Repeat steps 2 and 3 until a stopping condition is met.

The analogy to physical annealing comes from the fact that step 3 depends on a parameter called the temperature. In physics, the higher the temperature of a system, the more jittery the system is – that is, the more random motion there is among its constituent particles. Early in the optimization process, we set a high temperature; this allows the algorithm explore by accepting riskier transitions, i.e., those that result in a higher (worse) objective value than that of our current state. As the annealing progresses, we lower the temperature; this causes the optimization process to become more conservative. Eventually, the space of acceptable next states will contain only those that are better than the current state (in terms of objective value).

Global vs local optimization

One question you might ask is: Why bother with the high temperature phase at all? If low temperatures will allow the algorithm to only move toward better solutions, why not always make those kinds of moves? The key to answering this question lies in the difference between globally and locally optimal solutions to a problem. Suppose you are on a quest to see the view from the highest point in San Francisco (which has lots of hills). If you only ever step in the direction of steepest ascent from where you are, you will reach the top of some hill, but it’s possible that in order to reach the top of the highest hill, you should have walked downhill for a while in another direction first and only then started to ascend. The hill whose acme you reached by steepest ascent – a local optimum – is not very difficult to find. By contrast, finding the true tallest peak in San Francisco – the global optimum – is trickier.

For many problems – like finding the best set of parameters for a machine learning model – local optima work very well, and in many cases we have powerful algorithms for efficiently finding them. Finding global optima, on the other hand, is far more challenging in general, and good algorithms are scarcer, if they exist at all. Simulated annealing is a probabilistic strategy for searching for global optima by exploring aggresively enough early to find the base of the right hill.

In the remainder of this post, we will more explicitly discuss how we carry out steps 2 and 3, and show how we might apply this meta-algorithm to the traveling salesman problem, one of the most difficult discrete optimization problems we have.

Acceptance probability

The key detail that I want to make more precise is how we formulate the probability of accepting a transition from one state to another. In terms of notation, $\mathcal S$ is the entire state space, $s \in \mathcal S$ refers to the current state, $s' \in \mathcal S$ refers to the candidate state, $E:\mathcal S \to \mathbb R$ is the objective function (lower is better) that we want to minimize, and $T_k$ is the temperature parameter value on the $k$th step (a real number). Our objective is to find the state $s^\star$ that minimizes $E$. In other words, we want to find

$$ s^\star := \underset{s \in \mathcal S}{\text{argmin}} ~ E(s). $$

(The “$:=$” symbol means that the right hand side is the definition of $s^\star$, rather than some equation to prove or solve.)

For simplicity, define $e = E(s)$ and $e' = E(s')$. If we are currently in state $s_k$, and $e' < e$, we automatically transition to $s'$. If $e' > e$, then we transition with probability

$$ P_{\rm acc}(e, e'; T_k) = \exp(-(e' - e) / T_k). $$

(Note: This is not a probability distribution over states. Instead, here, $P_{\rm acc}$ is used to make a decision about whether to transition to a particular successor state. For this purpose, after we compute $p = P_{\rm acc}(e, e'; T_k)$, we can use a random number generator to generate a random number $r$. If $r < p$, we transition. To sample a state from the entire state space, we would need the transition probabilities for each possible transition to sum to 1.)

Let’s take a minute to think through why this acceptance probability works the way we want it to:

Since we would have accepted if $e' < e$, we can assume that $e' - e > 0$. If this difference is very positive, the negative sign and the exponential around it makes $P_{\rm acc}$ very small. This means that the probability of accepting a transition exponentially decays for less desirable candidates.
Decreasing the value of $T_k$ (as $k$ increases) causes the exponent to become large and negative, producing probabilities close to 0. This means that as we run more steps and decrease the temperature, the same differences in objective value will become less and less acceptable. This aligns with the intuition that as the temperature decreases, the optimization process becomes more conservative.

Application: The Traveling Salesman Problem (TSP)

If you’ve never heard of the traveling salesman problem, check out this wikipedia article before continuing. To summarize:

There are $n$ cities to visit.
There are roads connecting every pair of cities.
Each road has a (nonnegative) toll associated with it.
Goal: Find the minimum cost path that ends where you start and visits each city exactly once.

The first thing to remember when using simulated annealing is that for most problems we would apply it to, we should expect to not obtain the globally optimal solution at the end; instead, we hope for a result that is just good enough. In the case of TSP, simulated annealing can give us a reasonable approximation, but we cannot really guarantee anything more than that.

Another practical consideration that arises is how to define the state space for the problem at hand. Before continuing, think about how you might define it for TSP.

A sensible way to define it is to consider any ordering of the $n$ cities to be a state. Defining states this way, there are $n!$ states – for $n \geq 20$, this number is absolutely massive. With such large state spaces, one typically also has to narrow the space of transitions under consideration at each step. In the case of TSP, we might do this by only allowing transitions that swap a pair of cities in the order. Can you think of other methods?

Finally, we define our objective function $E$ to just be the total cost of a particular route, and we stop when we’ve gone some number of iterations without making progress over the best solution we’ve obtained so far. With this setup, we can implement our algorithm following the pseudo-python below:

def simulated_annealing_TSP(G, max_iters_with_no_improvement):
 """
 G: the initial problem structure (tolls for each road)
 max_iters_with_no_improvement: The maximum number of iterations allowed without
 surpassing the best seen so far before termination.
 """

 # initialize the temperature and pick an initial state
 T = initialize_temperature()
 s = pick_random_city_order(G)

 no_improvement_counter, best_state, lowest_so_far = 0, None, inf
 # while we're making progress...
 while no_improvement_counter < max_iters_with_no_improvement:
 # select a candidate next state
 candidate = randomly_swap_pair(s) # using any restriction
 # compute the costs of the current state and the candidate
 e_s, e_cand = total_cost(s, G), total_cost(candidate, G)
 # decide whether to accept by comparing a uniform random number
 # to the acceptance probability described earlier
 if uniform(0, 1) < p_acc(e_s, e_cand, T):
 s = candidate
 # if we see a new best, reset the progress counter and
 # save the best state
 if e_cand < lowest_so_far:
 best_state = s
 no_improvement_counter = 0
 else:
 no_improvement_counter += 1
 # reduce the temperature
 T = reduce_temperature(T)

 # return the best state when the iteration completes
 return best_state

There are some details and optimizations left out, but hopefully the code feels straightforward enough for you to try to implement this on your own!

(Note: One detail we left out in the above is the schedule to use to reduce $T_k$ over time. This is a subtle problem, since if we lower it too quickly, our optimization process will not sufficiently explore, whereas if we lower it too slowly, we may not make forward progress fast enough.)

Conclusion

In this post, we briefly described a meta-algorithm called simulated annealing that can help approximate global optima for properly formulated optimization problems, many of which are extremely computationally difficult. It is often most useful when there are not other more direct, problem-specific algorithms we can bring to bear. In addition to describing the general setup, we also looked at how we could apply this approach to the TSP, which gave us a flavor for some of the practical considerations that arise when trying to fit a problem into the SA framework.

Simulated annealing is a powerful tool that is employed to solve a variety of thorny optimization problems across the sciences. It’s a good tool to have in your toolkit – I hope it comes in handy!

RANSAC for robust data fitting

Fri, 22 Dec 2023 00:00:00 +0000

Introduction

In this post I want to introduce a non-standard way of fitting a mathematical model to data that I came across during a course I took this past semester in computer vision. While gradient descent rules the day (as it should!) the method we discuss here is actually pretty clever and its simplicity belies a significant benefit: robustness to outliers.

Linear models

What are they?

Much of machine learning and statistical modeling can be roughly characterized by the following sequence of steps:

Gather data about some phenomenon or process, often in the form of (input, output) pairs. In this setup, we hope that there is some meaningful relationship between the inputs and the outputs, and the outputs are the phenomenon we want to learn to predict.
Make some assumptions and try to find a useful mathematical model that explains the input/output relationship.
Use the learned mathematical model to make predictions on new examples that were not in the original training set of inputs and outputs.

(Mathematically, the model is a function $\hat f$ that maps inputs to outputs, hopefully without too much error. I call the function $\hat f$ here because one way of thinking about what we’re doing is that we’re trying to approximate some true function $f$ that relates the inputs to the outputs.)

There is a lot of nuance and subtlety to how we learn the model from data and how we verify that it’s working well on unseen examples, but that’s the basic idea.

One choice that we the modelers have to make before learning the model is the set of “shapes” it can take, which encodes an assumption about the underlying input/output relationship. The simplest possible model we can use is called a linear model, which assumes that the output changes linearly (usually with some small amount of random variation) as the input changes.

How do we learn them?

To make things more mathematically precise, let’s say our inputs $x_i$ are $d$-dimensional vectors of real numbers ($x_i \in \mathbf{R}^d$), and our corresponding outputs $y_i$ are real numbers ($y_i \in \mathbf{R}$). Using a linear model to model the relationship is equivalent to making the assumption that there is a vector of parameters $\theta \in \mathbf{R}^d$ (the slope) and an intercept $b \in \mathbf{R}$ such that

$$ \begin{align*} y_i = \theta^\top x_i + b + \varepsilon_i \end{align*} $$

where $\varepsilon_i$ is some randomness that our model doesn’t capture.

The mathematical characterization of the problem of finding the optimal parameters $\theta^\star$ and $b^\star$ is

$$ \begin{align} \theta^\star, b^\star := \underset{\theta, b}{\text{arg min}} ~\| \theta^\top X + b\mathbf{1} - y \|_2^2 \end{align} $$

Before continuing, let’s break down those symbols:

“$\theta^\star, b^\star :=$”: We typically use the $\star$ superscript to indicate the best or optimal choice. We use the $:=$ symbol to indicate that we are not writing out any mathematical equivalence, just a definition. We’re saying, “The optimal values of $\theta$ and $b$ are…”
“$\text{arg min}_{\theta, b}$”: If we had written something like $\min_x f(x)$, the “value” of the expression would be computed by plugging all possible $x$s into the function $f$ and returning the minimum value of $f(x)$. For example, if $f(x) = x^2$ and possible values of $x$ were $\{-1, 2, 0.1\}$, we would get $\min_x f(x) = f(0.1) = 0.01$. By Changing the $\min$ to $\text{arg min}$, we instead return the value of $x$ – rather than the value of $f(x)$ – that minimizes the value of $f(x)$. Thus, ``$\underset{\theta, b}{\text{arg min}}$’' means that we are looking for the values of $\theta$ and $b$ that minimize the rightmost term…
“$|| \theta^\top X + b\mathbf{1} - y ||_2^2$”: Without getting into detail here, this takes in our parameters $\theta$ and $b$, our data $X$, and our outputs $y$, and outputs a number that measures how well our parameters match inputs to outputs. (Lower is better.)

While we’re here, as is often the case with these types of formulations, we don’t get any information about how to write a computer program to actually obtain the parameters we seek. We have only addressed the problem setup. In fact, many such problem setups do not admit helpful algorithms even if they can be expressed simply.

Luckily for us, this particular problem can be solved very efficiently, and we most commonly use one of the following two methods:

Solve some equations using calculus and linear algebra (because in this case we can).
Use an algorithm called gradient descent (for cases when we can’t). This algorithm is the workhorse of almost all modern machine learning.

Neither of those is the topic of this post, but there are lots of nice articles out there if you are interested in learning more. For the rest of this post, I want to describe another less widely-known algorithm for finding those paramters.

RANSAC

One deficiency of the approaches mentioned above is that without certain auxiliary techniques, both of those methods are sensitive to outliers, as shown in the simple figure below (source):

Our intuition tells us that something is off here. The line in the image seems to miss that the true relationship, obscured by the outliers, is roughly along the diagonal from the bottom left to the top right of the plot.

One simple and interesting way of finding $\theta$ that is robust to outliers is called RANdom SAmpling Consensus (RANSAC). The way it works is as follows:

Randomly select a subset of $d$ input/output $(x, y)$ pairs. Stack the $x_i$s into a matrix $\tilde X$ ($x_i$ is the $i$th column) and the $y_i$s into a vector $\tilde y$.
Solve $\theta^\top \tilde X = \tilde y$ for $\theta$ (provided certain conditions are met, the equation has a unique solution).
Across the entire original dataset $X$, count inliers, or the number of $(x, y)$ pairs such that $\text{dist}(\theta^\top x, y)$ is small (the modeler chooses a threshold and an application-appropriate distance function $\text{dist}$).

In order to increase our chances of discovering a favorable parameter combination, we can repeat this process until the inlier count (perhaps as a fraction of the overall dataset size) is sufficiently high. The more times we are willing to repeat steps 1-3, the better a model we will find.

The beauty of this algorithm is that the paramter combination we end up with will have been computed from the most “normal” set of examples we encounter; in other words, the algorithm tends to avoid outliers that might adversely influence the model fit. To visually register how RANSAC is able to ignore outliers, the image below shows the much more reasonable line it would find with reasonable parameters (source):

While RANSAC is certainly elegant, there are downsides too. As we might expect, RANSAC tends to perform worse as the dataset becomes more and more polluted by outliers, though there are modifications to what we just described that increase the algorithm’s outlier tolerance. The primary disadvantage of RANSAC is that there is no guarantee that the algorithm will converge, which is a fancy way of saying that since our subset selections are random, we can’t be mathematically sure that we will eventually zero in on the optimal model parameters given our data.

Application to (old-school) computer vision

I came across RANSAC in a computer vision course at NYU taught by Robert Fergus. Near the end of the course, after reveling in the various and wondrous ways that neural networks have upended and redefined how computers process and, more recently, create, visual information, we had a final unit on techniques in computer vision that predated the deep learning revolution.

One problem from that unit that one might be interested in solving is to find some kind of correspondence between two images. For example, given the two images of the same scene on the left, we might want to produce a mosaic image like the one on the right (source):

To highlight where RANSAC comes into play, let’s suppose that we’ve waved our magic wand and (1) identified “key points” in both of the individual images, and (2) determined a set of correspondences between the key points in the top and bottom images.

RANSAC can help figure out the transformation that “happened” that caused the key points in one image to turn into their (hopefully correct) corresponding points in the other. It turns out that this transformation has parameters in and we can actually find the correspondence using RANSAC as follows:

Randomly select a subset of matching points from both images. (The number of matching points is determined by the number of parameters to estimate, in this case, 6.)
Find the parameters of a transformation $T$ that would turn points in one image into their matching points in the other.
For each matching pair of points, denoted $(k_1, k_2)$, count the number of inliers, i.e., pairs for with $T(k_1)$ isn’t far from $k_2$.

After finding the transformation that corresponds to the largest number of inliers, we can use this transformation to carry out the remaining steps required to combine the images.

Conclusion

In this post, we learned about RANSAC, an algorithm that finds the parameters of a model that can handle the presence of otherwise corruptive outliers. Algorithms that operate by trying a bunch of (literally) random options are generally the stuff of novices, so it’s cool when that very simplicity turns out to solve an important technical problem.

Is addition commutative?

Fri, 14 Jul 2023 00:00:00 +0000

Introduction

This will be a quick one. From the very beginning of our mathematical educations, there is a fundamental fact of which we are made aware: If $a$ and $b$ are numbers of any kind (natural numbers, integers, rationals, reals, complex), then – now hold on to your hats –

$$ \begin{align*} a + b = b + a. \end{align*} $$

Earth-shattering – I know. As some of you are likely already aware, I find infinity fascinating. In this post, I want to briefly discuss how infinity can mess with some of our most basic assumptions about the nature of one of our most basic arithmetic operations. Without further ado, let’s dive in.

Infinite series

In our everyday lives (unless you’re a mathematician), we only ever consider what it means to carry out arithmetic operations on finite collections of numbers. A first course in calculus challenges us to think about the nature of infinite sums, objects like:

$$ \begin{align*} 1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + \dots \end{align*} $$

$$ \begin{align*} 1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \dots \end{align*} $$

But what – you might ask – is so interesting about these sums? Don’t an infinite number of numbers added together have to add up to $\infty$?

(Mathmatical aside – feel free to skip: Before answering that question, it is worth considering what mathmeticians even mean when they ask about the sum of an infinite number of terms. We could never actually add up infinitely many terms, so instead, the “sum” of an infinite series is the limiting value of the sequence of the series' partial sums. That is for the second series, we want to know the limiting value of the sequence $1, 1 + \frac{1}{2}, 1 + \frac{1}{2} + \frac{1}{4}, \dots$.)

It turns out that some infinite series have finite sums (like the second one) and others do not (like the first one). There are lots of rules and tests that one can perform to determine what kind of series one is looking at, but for our purposes, it suffices to know that we call the ones with finite sums convergent and the ones whose sums are infinite divergent.

The harmonic series diverges

The first series we looked at earlier is so famous that it has a name: the Harmonic Series. It is an example of a series whose underlying sequence has terms that get smaller and smaller, but which, despite that fact, diverges. For full effect, let’s briefly have a look at why it diverges.

Let $H$ denote the sum of the Harmonic Series, and let’s compare $H$ to the sum of another, similar-looking series $H'$:

$$ \begin{align*} H &= 1 + \frac{1}{2} + \biggr(\frac{1}{3} + \frac{1}{4}\biggr) + \biggr(\frac{1}{5} + \frac{1}{6} + \frac{1}{7} + \frac{1}{8}\biggr) + \dots \\ H' &= 1 + \frac{1}{2} + \biggr(\frac{1}{4} + \frac{1}{4}\biggr) + \biggr(\frac{1}{8} + \frac{1}{8} + \frac{1}{8} + \frac{1}{8}\biggr) + \dots \end{align*} $$

Notice that (1) each of the parenthesized groups of terms in $H'$ sum to $1/2$, and (2) each of the corresponding groups of terms in $H$ sums to a number that is greater than $1/2$. One of the rules one learns in calculus is (informally) that if you are comparing two sums and the smaller of them is infinite, the larger one must be infinite too. Since $H > H'$ because of (1), and $H' = \infty$ because it is a sum of infinitely many $1/2$s, $H$ must be infinite too.

Can we make it converge?

Even though $H$ diverges, we can make it converge by changing half of the additions into subtractions like so:

$$ \begin{align*} 1 - \frac{1}{2} + \frac{1}{3} - \frac{1}{4} + \frac{1}{5} - \frac{1}{6} + \frac{1}{7} - \frac{1}{8} + \dots \\ \end{align*} $$

It turns out that this modified series converges! (We will not spend time proving that this series converges, but if you’re interested, I wrote another post a while ago covering that; you can check it out here.) To make things even weirder, if you group the terms with even denominators together, the sum of their absolute values diverges. (The same applies to the group of terms with odd denominators, as you might expect.)

For visual intuition that this series converges (in lieu of a proof), this image should help (source):

The image shows that it converges to approximately 0.694. I claim that I can make it converge to whatever value I want…

But how?

Other arrangements

Let’s say that I wanted the sum of the terms of the alternating harmonic sequence to be 2. Carry out the following algorithm:

Add positive terms together until we exceed 2 (we would start with $1 + 1/3 + \dots + 1/15 \approx 2.02$)
Add negative terms to the output of step 1 until we were below 2 (we would then add $-1/2$, so we’d end up at around $1.72$)
Add positive terms to the output of step 2 until we exceed 2 (starting with $1/17$, an so on)
Add negative terms until we fell below 2
etc.

Why will this sequence of numbers (2.02, 1.72, …) produced by the algorithm I proposed converge to 2?

As we mentioned earlier, the absolute values of terms in both positive and negative groups both sum to $\infty$, so we will always have enough terms that we haven’t used that can get us above or below 2 when we need to.
The sizes of the successive terms from each group that we use are getting smaller and smaller, so at each step, the amount by which we exceed and fall below 2 shrinks.

Taken together, we see that will oscillate around 2, and successive oscillations will be smaller and smaller; that’s simply a fancy way of saying that our series now converges to 2. But didn’t it just converge to 0.694?

This is a quirk of trying to sum infinitely many terms. When finitely many terms are involved, rearranging terms doesn’t affect anything. There are even many infinite sums for which the same is true (our second one from earlier, for example). With infinity, though, it is always important to reexamine facts that are obvious or trivial-seeming in finite territory. We’ve just shown that under special conditions, the infinite version of $a + b$ is not necessarily the same as $b + a$.

Faster language model inference

Thu, 06 Apr 2023 00:00:00 +0000

Introduction

Over the past few years, large language models (LLMs) – most recently ChatGPT – have received lots of (well-deserved) press. Though they have their shortcomings, they are able to compose shockingly cogent prose, and the quality appears to increase the bigger the models themselves become.

One aspect that I think is often overlooked by much of the public is the (computational, which implies financial) cost of actually turning the model’s crank to produce text. Soon after ChatGPT was released, it was estimated that at 1 million users it was costing OpenAI around 3 million dollars per month in cloud compute costs. With 100 million users (assuming nothing else has changed), this would cost 300 million dollars per month, or 3.6 billion dollars per year!

One of the themes we’ve observed over and over in technology over the past few decades is that when a disruptive, but costly, technological advance emerges, the cost of producing – or in this case, serving – that technology typically declines in response to increased motivation to profitably unleash the technology’s capabilities.

In this post, I want to describe a relatively simple and intuitive technique from a paper by DeepMind that might be a first step in the direction of bringing down the cost of operating the types of large language models that I believe will become ubiquitous in the years and decades to come.

(As a note, below I will use the word “expensive” a lot. This word refers to computational expense, but computational expense is directly relatable to financial expense, so you can think of it that way too if you’d like.)

Background

At a very high level, many large language models (such as GPT and friends) generally produce text using autoregressive sampling, which is a fancy term for sampling used previously generated (hence autoregressive) text to produce a probability distribution (hence sampling) over possible next words. To understand what this means, suppose your vocabulary has three words: “apple,” “banana,” and “carrot.” A distribution over these three words in essence assigns a probability to obtaining each word (the probabilities have to add up to 1) if you were to sample randomly from them. (There are infinitely many possible distributions you could choose. Usually, assumptions about distributions have to be reasonable, not correct.)

The distribution over your vocabulary words is determined by the model at each step. In some sense, the model produces a distribution that makes sense given the text you’ve already generated (and/or a prompt that you wrote). If you had already generated the partial sentence “I will wear a raincoat because it is going to,” a good model would produce a distribution over your vocabulary that indicates a very high probability on a word like “rain,” and a very low probability on a word like “spinach” (which contextually doesn’t fit).

The takeaway from this can be summarized as follows:

These models are very large, so each step (word) is expensive to compute.
Because the samples have to be produced sequentially, they require many LLM steps.

Put simply, many steps x high cost = very high cost!

The New Idea

DeepMind attempts to tackle (2) from the previous section by reducing the number of inference steps required of the very large (and very expensive) model while maintaining the high quality of the tokens; it sounds like free lunch! How do they do this?

The basic idea is that given some number of previous words, we:

Use a smaller, less expensive model to generate a candidate sequence of a certain length.
Use the big model to score the words generated by the small model. (The scores here can be thought of as measures of approval of the draft tokens by the big model.)
Use the word scores to decide how much of the sequence to use.
Rinse and repeat until the sequence is of the desired length.

If you stop reading here, you’ve learned the important idea. Lately, I’ve been trying to keep posts very high level, but in this case, the fact that the importance-to-complexity ratio of this idea is very high, I’m going to break pattern and go into some more technical detail in the sections below.

Speculative Sampling

We will now discuss the algorithm in more detail. The below steps are carried out until the sequence is of the desired length.

Step 1: Generate and Score a Draft

Draft generation must be sequential, but the scoring can happen in parallel. (This is one speed-up source.) In this context, scoring a token means computing the probability of that token occurring given the current sequence and the already generated draft tokens. This is where the algorithm’s speed-up comes from. The draft is generated using a smaller, less expensive model (the draft model), and the scoring – which requires appealing to the large model – can happen in parallel.

Step 2: Deciding How Much of the Sequence to Accept

Next, the algorithm requires figuring out how much of the sequence to accept. This takes the form of accepting each successive token produced by the draft model with some probability (that depends on the prior tokens that have already been accepted). Once we decide not to accept a token, we sample a token from some distribution (we will think through a good one to use below) and start fresh with a new draft. In my opinion, choosing the right probability and the right alternative distribution is where the algorithm’s cleverness is really on display. To be specific about what we aim to disambiguate in this section, there a two questions we seek to answer:

What probability $r$ should we use to decide whether or not to accept the next token?
What alternative distribution should we use if the $i$th token of the draft is not accepted?

Before addressing these questions, we should state this algorithm’s overall objective a little bit more precisely: We want the speculatively sampled sequence to come from the same distribution as the sequence we would get if we autoregressively sampled a sequence from the large expensive model.

Another item to clarify is that we can, in some sense, discuss models and distributions interchangeably. Here, a model is a way of taking a sequence and producing probabilities that each token in the vocabulary is the next one in the sequence. We can thus refer to and talk about models like we refer to distributions. To this end, let $q(x \mid x_1, x_2, \dots, x_i)$ be the expensive model, and $p(x \mid x_1, x_2, \dots, x_i)$ be the draft model. As in the paper, we will also refer to the $t$th draft token as $\tilde x_t$.

I will first state the answers to questions (1) and (2), and then I will show that they work. The probability $r$ that we use is given by the expression

$$ \begin{align*} r = \min\biggr(1, \frac{q(\tilde x_t \mid x_1, \dots, x_{n+t-1})}{p(\tilde x_t \mid x_1, \dots, x_{n+t-1})} \biggr) \end{align*} $$

and the rejection distribution is

$$ \begin{align} (q(x \mid x_1, \dots, x_{n+t-1}) - p(x \mid x_1, \dots, x_{n+t-1}))_+ = \frac{\max(0, q(x \mid x_1, \dots, x_{n+t-1}) - p(x \mid x_1, \dots, x_{n+t-1}))} {\sum_{x'} \max(0, q(x' \mid x_1, \dots, x_{n+t-1}) - p(x' \mid x_1, \dots, x_{n+t-1}))} \end{align} $$

These probabilities and distributions depend on the combined initial sequence and already accepted draft tokens. Also, note that if $q(x \mid x_1, \dots, x_{n+t-1}) > p(x \mid x_1, \dots, x_{n+t-1})$, we automatically accept the $t$th draft token. Intuitively, this checks out, because if the draft model produces a token which, given the prior sequence tokens, is more likely to have been produced by the large model than the draft model, of course we should use it!

If instead we have $q(x \mid x_1, \dots, x_{n+t-1}) \leq p(x \mid x_1, \dots, x_{n+t-1})$, then we accept $\tilde x_t$ with a probability that is larger when $q$’s score is close to $p$’s. If $p$ gives $\tilde x_t$ a score of 0.36 and $q$ gives it a score of 0.12, for example, then we will accept $\tilde x_t$ with probability 0.12 / 0.36 = 1/3. Alternatively, if $p$ gives a score of 0.36 and $q$ gives a score of 0.0001, we will likely not accept the token because the target and draft models really disagree about whether $\tilde x_t$ would make a good next token.

(Recall from earlier that we computed the $q$ scores for the draft tokens in parallel during Step 1.)

If we accept $\tilde x_t$, then we take another step and consider whether or not to accept $\tilde x_{t+1}$. If, on the other hand, we reject $\tilde x_t$, we sample from the complicated looking distribution we spelled out in equation (1). What we want is a distribution that re-weights the possible tokens to sample from the large expensive model in a sensible way. In our case, $q - p$ (in the numerator) will produce negative “probabilities” for tokens where $q < p$, and positive values where $q > p$. This kind of makes sense, because if $\tilde x_t$ is rejected, we want to favor sampling tokens to which $q$ assigns higher scores than $p$ does. We have two problems, though:

Probabilities have to be nonnegative
Probabilities have to sum to 1

But these are actually no problem at all! To solve (1), we modify $q-p$ to $\max(0, q-p)$, and solve (2), we use the standard normalization trick: dividing by the sum! (For example, if I wanted to make the list [1, 2, 3, 4, 5] into a probability distribution, I would divide each element by the sum to obtain [1/15, 2/15, 3/15, 4/15, 5/15].) Once we make both of those modifications, we obtain equation (1), which we shorten to the expression on the left side of the equals sign.

So far, we have resorted to intuition to motivate our choices, but it turns out that the two choices fit together like elegant mathematical puzzle pieces. How this happens goes back to our objective, which is to come up with a sample that looks as though it was obtained strictly using the target (expensive model). Let’s see if we can show that our strategy helps accomplish this.

Proving that we recover the target distribution

If we have two discrete distributions $a$ (target) and $b$ (draft) and a draft sample $x' \sim b$ ($\sim$ means “sampled from”), let $X$ be a random variable representing the sample produced by speculative sampling. If $X$ ends up taking on a specific value $x$, there are two possible ways it could have happened:

We accepted $x'$, in which case $x' = x$.
We rejected $x'$, in which case $x \sim (a - b)_+$.

Outcome 1

The probability of outcome (1) is the probability that the draft sample $x'$ is accepted given that it takes the particular value $x$. We have to multiply this by the probability that the draft distribution assigns to the event that $x'$ takes on that value, so we have

$$ \begin{align*} P(\text{option 1}) = P(x'~\text{accepted} \mid x' = x)P(x' = x) \end{align*} $$

The probability of sampling the value of $x$ from the distribution $b$ is simply $b(x)$. The probability of accepting it is $\min(1, a(x) / b(x))$ (from the algorithm specification). We thus have

$$ \begin{align*} P(\text{option 1}) = b(x) \min(1, a(x)/b(x)) = \min(b(x), a(x)). \end{align*} $$

Outcome 2

On the other hand, if $x'$ is rejected, then the probability that $X$ takes the value $x$ is the probability of sampling $x$ from $(a(x) - b(x))_+$. By our definition of that distribution, we would have

$$ \begin{align*} P(X = x \mid x'~\text{rejected}) = \frac{\max(0, a(x) - b(x))}{\sum_{\hat x} \max(0, a(\hat x) - b(\hat x))}. \end{align*} $$

We need to weight this outcome by the probability that $x'$ is rejected, which is given by

$$ \begin{align*} P(x'~\text{rejected}) &= 1 - P(x'~\text{accepted}) \\ &= 1 - \sum_{\hat x} P(X = \hat x, x'~\text{accepted}) \\ &= 1 - \sum_{\hat x} \min(a(\hat x), b(\hat x)) \\ &= \sum_{\hat x} a(\hat x) - \min(a(\hat x), b(\hat x)) \\ &= \sum_{\hat x} \max(0, a(\hat x) - b(\hat x)). \end{align*} $$

In the above sequence of equalities, the second uses the fact that to get the $p(x)$, we can sum up $p(x, y)$ over all possible values of $y$. The third equality reuses our computation from Outcome 1. The fourth uses the fact that the 1 outside the summation can be broken into probabilities $a(\hat x)$ for all possible values of $\hat x$, since probabilities must sum to 1. Finally, the last equality follows when you flip $-\min(a(\hat x), b(\hat x))$ to $\max(-a(\hat x), -b(\hat x))$ and then add the $a(\hat x)$ to both of the arguments to $\max$.

Now that we’ve worked all of the details out, does the last expression look familiar? It is the denominator of $P(X = x \mid x'~\text{rejected})$! Multiplying our two probabilities together, we have

$$ \begin{align*} P(\text{option 2}) &= P(X = x \mid x'~\text{rejected}) P(x'~\text{rejected}) \\ &= \frac{\max(0, a(x) - b(x))}{\sum_{\hat x} \max(0, a(\hat x) - b(\hat x))} \sum_{\hat x} \max(0, a(\hat x) - b(\hat x)) \\ &= \max(0, a(x) - b(x)) \end{align*} $$

Putting Them Together

Now that we’ve computed probabilities for both options, we note that the two possibilities are mutually exclusive and exhaustive ways that $X$ can take the value $x$. Thus, the probability $P(X = x)$ is given by

$$ \begin{align*} P(X = x) &= P(\text{option 1}) + P(\text{option 2}) \\ &= \min(b(x), a(x)) + \max(0, a(x) - b(x)). \end{align*} $$

Now, if $a(x) > b(x)$, then the first term is $b(x)$ and the second term is $a(x) - b(x)$. Adding these together, we get $a(x)$. If $a(x) \leq b(x)$, then the first term is $a(x)$ and the second term is 0, so again the sum is $a(x)$. Thus, speculative sampling recovers the target distribution $a(x)$. In other words, the rejection sampling technique we’ve devised produces sequences of tokens that are theoretically indistinguishable from the very expensive target model!

Conclusion

While this technique is just one step towards making LLMs more efficient, it highlights the potential return on further innovation in the space of faster LLM inference. As the number of LLM applications continues to explode, we can expect even more creative solutions to emerge, hopefully making these powerful tools more accessible and affordable for everyone.

If not speculative sampling itself, methods in the same spirit will become more necessary and important as we continue to push the boundaries of size and scale in generative models. I thought this technique was worth illuminating because of its simple, yet powerful, theoretically grounded choices. In many deep learning applications, systems often seem like quasi-magical feats of engineering whose designers don’t even always know why they work as well as they do. In reading DeepMind’s speculative sampling paper, I found the technique’s simplicity and mathematical rigor refreshing.

How does OpenAI's DALL-E work?

Tue, 03 Jan 2023 00:00:00 +0000

Introduction

Before I say anything else, create an account and take DALL-E 2 out for a spin. It is an example of generative AI, which, as a field, has seen important, exciting, and viral breakthroughs during 2022. Here are some examples from the DALL E paper:

Examples of captions and generated images from the DALL-E paper.

To understand what generative AI is, let’s say you have images that have each been labeled as either an image that contains a cat, or an image that does not contain a cat. A (probabilistic) discriminative model is one that tries to extract some kind of meaningful information from images (such as the presence of certain types of edges or textures) and discriminate between cat and non-cat images. Most classification models implemented across industry are of this type. A generative model, on the other hand learns how to generate new (image, cat/no cat label) pairs. Ian Goodfellow, a pioneer of generative modeling, very succinctly explains why generative modeling is usually much harder to do well:

Can you look at a painting and recognize it as being the Mona Lisa? You probably can. That’s discriminative modeling. Can you paint the Mona Lisa yourself? You probably can’t. That’s generative modeling.

The most popular examples of generative AI that have emerged over the past year have been in the fields of computer vision (Stable Diffusion, DALL-E) and natural language processing (Chat GPT), and in this post, I want to try to provide a very high-level overview of how DALL-E learns to generate images from text prompts.

What we will not discuss (but should at least mention)

The “machine learning” that powers many of these large, impressive models is very often not the hardest part of developing them. There are computational considerations and optimizations that are critically important to making such models actually work, but I don’t think describing them here will add a lot of value for my readers, so we won’t be going into those details. (In fact, in the DALL-E paper, they explicitly acknowledge that “[g]etting the model to train in 16-bit precision past one billion paremeters, without diverging was the most challenging part of this project.”)

Another important thing to remember is that these amazing capabilities of these models emerge from mountains of high quality data and computing power. Even if the model code is made available to the public, training and using the models without sufficient ability to handle huge quantities of data and throw massive amounts of compute at the problem will result in useless models at best. There are some efforts to create smaller versions of these models that can be run and used by individuals, but so far they don’t seem as capable, in general, as their huge progenitors.

Disclaimer

This post is my high-level overview/summary of part of the DALL-E paper. It is possible that I misunderstood something or explained it incorrectly. If you come across any such mistakes, let me know so that I can learn from and correct them.

With that out of the way, let’s have a look under the hood.

High level overview

DALL-E learns using about 250 million (image, text) pairs. The learning process is roughly:

Learn how to come up with useful (and compressed) representations of images
Turn each prompt into a sequence of tokens
Combine the representations of the images and the corresponding prompts
Train a neural network to predict the next token of the combined representation given some of the previous tokens.

We will discuss each of these steps in turn.

Learning image representations

One of the most important, foundational ideas in deep learning is: neural networks like numbers (not images or text). In order to bring the full power of neural networks to bear on text and vision problems, we usually use approaches that are variations on this theme:

Convert the image (or text) into a numerical representation such that the meaning/content of the image (or text) is preserved. (For example, in representation space, collections of numbers corresponding to images of cats should be closer to one another than they are to collections of numbers corresponding to images of galaxies.) We will refer to these as representations or embeddings.
Use the representations for some task that we care about (e.g. discriminate between cat and non-cat images).

There are many approaches that accomplish step 1. In the case of DALL-E, OpenAI used an autoencoder. This architecture consists of two parts:

Encoder: takes an image and produces a representation
Decoder: takes the representation and tries to reproduce the image

In order to learn, the encoder and decoder use penalties incurred for differences between the original image and the decoder’s reconstruction to update their internal states. When training is complete, we can use the encoder to produce useful image representations for whatever other tasks we intend to carry out with the images as input. Intuitively, we can think about an autoencoder as a kind of compression algorithm. We take an image, compress it into a representation that is (1) smaller than the original image, but (2) such that we can (mostly) reconstruct the original image. If we can do this well, then we’ve come up with a representation that seems to carry much of the important information from the original image, which is what we wanted.

(Really, it uses the generative and more involved cousin of autoencoders called a variational encoder, but the high level idea of learning a compressed representation is the same.)

The autoencoder that DALL-E uses compresses 256x256 (dimensions in pixels) images into 32x32 representations. Each 256x256 image has 3 numbers associated with each pixel (red, green, and blue concentrations). Thus, each original image requires 256 * 256 * 3 = 196,608 numbers to represent it, whereas the representation only requires 32 * 32 = 1024 numbers. This is a compression factor of 192!

For a visual idea of how good these encodings are once the system is trained, here are some examples of original and reconstructed images:

Original (top) and reconstructed (bottom) images produced by the image representation learning system.

(Note: Representations of this kind are often continuous, meaning the numbers in each representation slot can be any real number. In this case, the encoding is discrete, which just means that the numbers in each representation slot are actually whole numbers instead.)

Encoding prompts

To encode the prompts, DALL-E uses byte-pair encoding, which can be more broadly be categorized as a tokenization method. Tokenization is a way of breaking down unstructured natural language into a finite set of meaningful atomic units. It can be carried out at the level of sentences, words, or even parts of words (e.g., “eating” might be broken into “eat” and “ing”). Each token in a limited vocabulary (e.g., 10k frequent words or subwords) is usually assigned an identifier – in this case, a number (there are usually also numbers reserved for unknown tokens, beginnings and ends of sentences, etc.). Once we’ve decided how we’re going to tokenize and have a vocabulary of identifiers, we can then process our input text and turn it into its corresponding tokenized representation.

(We can manually decide the level at which to tokenize text (e.g., words, sentences) or we can use machine learning to learn a good tokenization, but we aren’t going to go into detail about those approaches here.)

Jointly modeling text and image tokens

Once we have computed our image representations and tokenized our text, we glue the representations together into what is essentially a composite representation for an (text, image) pair. We then train a model called a transformer whose inputs are the entire stream of concatenated text and image tokens. The model learns to predict the next token in the sequence using earlier tokens.

Transformers are based on a mechanism called (self-)attention. In our case, this means that the model learns to give different weight to different previously generated tokens as it attempts to predict the next one. For example, if we were trying to predict the next word (“cold”) in the sentence “I need a jacket because it is ___”, the model would learn that it should give the word “jacket” more weight than the word “I.”

(DALL-E actually uses multi-head attention, which means it actually learns many weighting schemes at the same time for a given set of previously generated tokens and then makes a prediction using a combination of the outputs of all of those schemes.)

How to generate images

Once we’ve trained the transformer from the previous section, given a new prompt that we haven’t seen, we can:

Encode it using the encoder we trained in the first step
Generate (hopefully reasonable) image tokens one-by-one using our transformer. The first image token would be generated using all of the text tokens, the second image token would use all text tokens and the first image token, and so on. (Note: This is not exactly how it works, but it’s close enough for our purposes.)
Use the decoder that we trained in the first step to translate those image tokens back into an image

Conclusion

I glossed over many of the mathematical and computational details of how this works (I don’t even have my head around all of them!), but the goal was to demystify one approach used to build a(n awesome) generative AI system that has taken the internet by storm. Hopefully you enjoyed!

Distinct values in a data stream

Sun, 11 Sep 2022 00:00:00 +0000

Introduction

In this post I detail a randomized algorithm (that looks rather like black magic) to count the number of distinct elements in a data stream.

Naive solution

Suppose that data is presented as a sequence of values $\sigma = s_1, \dots, s_m$ where, for simplicity, the $s_i \in \lbrace 0, \dots, n - 1\rbrace$. I want to know the number of distinct values that were in the stream. For example, if the sequence were $\sigma = 1,2,3,4,5,5,7$, our algorithm should output 6. How might we accomplish this?

A very simple way is to keep an $n$-bit vector $v$ ($n$ because that is the size of the set our values are being drawn from) where $v_i$ represents whether we have seen the element $i$. Once we have seen all of the data, we sum the values in $v$ and output that as our result. Easy, right?

Does it work?

The issue here is that for sufficiently large data sets, using $n$ bits of storage is not possible. The above approach is (provably) optimal in terms of space… but that “optimality” is only with respect to deterministic algorithms that produce the correct result every time. What if we used randomization? Might we be able to achieve sublinear space usage?

A better solution

The rest of this post will detail a randomized algorithm that uses sublinear space and to solve the above problem with a solution that is correct as close to 100% of the time as we’d like… it’s rather like magic. Let’s see how it works.

The algorithm is as follows:

Choose $\varepsilon \in (0,1)$.
Let $t = \frac{400}{\varepsilon^2}$.
Pick a pairwise independent hash function $h: \{0,\dots,n-1\} \to \{0,\dots,n-1\}$
Upon receiving $s_i$, compute $h(s_i)$ and update $D$, a datastructure with which we keep track of the $t$ smallest hash values we’ve computed.
When we stop receiving values, let $X$ be the $t^{th}$ smallest hash value and output $\frac{nt}{X}$.

Before we move on, note that the algorithm only requires space for (1) the hash function (discussed below) and (2), the data structure $D$, which is a constant amount of space that depends on $\varepsilon$… Assuming that our hash function doesn’t take up too much space, the algorithm satisfies our space requirement.

I imagine you’re thinking what I’m thinking (or what I was thinking)… namely, that there is absolutely no reason why that should work. Before we dive into some completely mind-blowing mathematical analysis, I want to quickly digress to explain what a pairwise independent hash function is and then to fill in some details and provide an “intuitive” flavor for where this algorithm comes from.

Pairwise independent hash functions

Imagine we have a hash function $h$, two arbitrary inputs $x_1$ and $x_2$, and their corresponding outputs $y_1 = h(x_1)$ and $y_2 = h(x_2)$. The hash function $h$ is a pairwise independent hash function if knowing the probability that $h(x_1) = y_1$ does not give us any information about the probability that $h(x_2) = y_2$.

In mathematical terms, $h$ is pairwise independent if $\Pr[h(x_1)=y_1 \wedge h(x_2) = y_2] = \frac{1}{n^2}$ ($n$ is the size of the output space in our case). The natural question we ask when we present a definition is: Do such objects exist? Without going into too much detail about why, be assured they do indeed exist, and we are going to pick ours from the family $\mathcal{H} = \lbrace h_{ab}: \lbrace 0,\dots,p-1 \rbrace \to \lbrace 0,\dots, p-1\rbrace \rbrace $ where $p$ is prime, $0 \leq a \leq p -1$ and $0 \leq b \leq p-1$, defined, for some input $k$, by $h_{ab}(k) = ak + b \mod p$. In our case, if $n$ is not prime, we can find a prime near $n$ and let $p$ be that prime (an interesting proof for another time is that for any $n$, there is always a prime between $n$ and $2n$). Note that the only thing we have to store about this hash function to use it are $a$ and $b$ – each of which only requires $\log p$ bits of storage (so we are still under the linear space we are trying to avoid).

How do we get $\frac{nt}{X}$?

Next, I’ll try to motivate where $\frac{nt}{X}$ comes from. Suppose that there are $k$ distinct values in the stream (that is, suppose that $k$ is the solution to our problem). Let those values be $a_1,\dots,a_k$. Because we have $n$ possible outputs of our hash function and there are $k$ distinct values, we can expect that the distance between the $h(a_i)$ is $\frac{n}{k}$. In particular, we expect the $t$th smallest value to be at $t \cdot \frac{n}{k}$. Thus, $X \approx \frac{nt}{k}$. Solving for $k$, we see that $k \approx \frac{nt}{X}$, so that’s exactly what we output. In the next piece, we will need to distinguish between the right answer and our output so the correct answer will henceforth be referred to as $k$ and we will refer to our output, $nt/X$, as $\hat k$.

So does the fancy algorithm work?

All that’s left to do is now to show that $\hat k$ is very close to $k$. Mathematically speaking, we want to show that

$$\frac{k}{1 + \varepsilon} \leq \hat k \leq (1+\varepsilon)k$$

with a probability $\geq \frac{99}{100}$ (where $\varepsilon$ is the parameter we chose in the first step of the algorithm). If we can show that $\Pr[\hat k > (1+\varepsilon)k] \leq \frac{1}{200}$ and that $\Pr[\hat k < k/(1 +\varepsilon)] \leq \frac{1}{200}$, we get

$$\Pr[k/(1+\varepsilon) \leq \hat k \leq (1+\varepsilon)k] = 1 -\Pr[\hat k > (1+\varepsilon)k] -\Pr[\hat k < k/(1 +\varepsilon)].$$

Because each of the probabilities on the right side of the inequality are $\leq \frac{1}{200}$, we can rewrite the above as $\Pr[k/(1+\varepsilon) \leq \hat k \leq (1+\varepsilon)k] \geq 1 - \frac{2}{200} = \frac{99}{100}$ which is what we want.

So all we have left to do is to show that the two probabilities are indeed both $\leq \frac{1}{200}$. We will only do analysis of one of the two probabilities because a symmetric argument takes care of the other side. All of this put together means we only have one claim left to prove: $\Pr[\hat k > (1+\varepsilon)k] \leq \frac{1}{200}$.

First, we note that

$$ \begin{align*} \Pr[\hat k > (1+\varepsilon)k] &= \Pr[ (nt)/X > (1+\varepsilon)k] \\ &= \Pr\biggr[X < \frac{nt}{(1+\varepsilon)k}\biggr]. \end{align*} $$

With this in mind, define a random variable $Y_i$ which takes the value 1 if $h(a_i) < \frac{nt}{(1+\varepsilon)k}$ and 0 otherwise. Now, observe that on average, the odds of $h(a_i)$ taking a value less than $\frac{nt}{(1+\varepsilon)k}$ is the number of hash values between 0 and $\frac{nt}{(1 + \varepsilon)k}$ divided by the number of possible values $h(a_i)$ can take. We can write this mathematically as

$$ E[Y_i] = \frac{tn}{(1+\varepsilon)kn} = \frac{t}{(1+\varepsilon)k}. $$

Next, let the random variable $Y$ be the sum of the $Y_i$. Because expectation is linear, we can infer that $E[Y] = \sum_{i = 1}^k E[Y_i] = k \cdot\frac{t}{(1+\varepsilon)k} = \frac{t}{1 + \varepsilon}$. We also see that in this case, $\text{Var}(Y) = \frac{t}{1+\varepsilon} - \frac{t^2}{(1+\varepsilon)^2} \leq \frac{t}{1 + \varepsilon} = E[Y]$. We’re almost there!

We can now more readily examine the probability we were interested in above in terms of $Y$. That is, we can say

$$\Pr \biggr[X < \frac{nt}{(1+\varepsilon)k} \biggr] = \Pr[Y \geq t].$$

Why? The left hand probability represents the chances that the $t$th smallest hash value we saw was less than some value, let’s call said nasty value $M$ for a minute. $Y$ is the number of hash values we saw that were less than $M$. If at least $t$ values hashed to values less than $M$, then $X$, the $t$th smallest hash value will be less than $M$, hence the equality.

Using our variance result from earlier, we have $t = (1 + \varepsilon) E[Y]$, we really want to know $\Pr[Y \geq (1 + \varepsilon) E[Y]]$. This is bounded above by $\Pr[|Y - E[Y]| \geq \varepsilon E[Y]]$ (adding in the absolute value causes inequality rather than equality). Chebyshev’s inequality tells us that

$$\Pr[|Y - E[Y]| \geq \varepsilon E[Y]] \leq \frac{\text{Var}(Y)}{\varepsilon^2 E[Y]^2}.$$

Because $\text{Var}(Y) \leq E[Y]$, we can write

$$\frac{\text{Var}(Y)}{\varepsilon^2 E[Y]^2} \leq \frac{E[Y]}{\varepsilon^2 E[Y]^2} = \frac{1}{\varepsilon^2 E[Y]}.$$

Now recall that earlier, we said that $t = (1 + \varepsilon) E[Y] \iff E[Y] = \frac{t}{1 + \varepsilon}$. We can substitute this in for $E[Y]$ above and get

$$\frac{1}{\varepsilon^2 E[Y]} = \frac{1 + \varepsilon}{\varepsilon^2 t}.$$

Because $\varepsilon$ is at most 1, we conclude

$$\frac{1 + \varepsilon}{\varepsilon^2 t} \leq \frac{2}{\varepsilon^2 \frac{400}{\varepsilon^2}} = \frac{2}{400} = \frac{1}{200}.$$

Thus, all in all, we’ve shown that the odds of our return value being an over-estimate is bounded above by 1/200. A similar argument shows that the probability of underestimating is also bounded above by 1/200, so the probability of erring is at most 1/200 + 1/200 = 1/100 which means our probability of success is at least 99/100, as desired.

The St. Petersburg paradox

Wed, 18 May 2022 00:00:00 +0000

In this post, I want to talk through a simple mathematical result that forces us to think twice about relying too heavily on averages.

Quick review of expected values

Given an opportunity to play a game with uncertain outcomes, one reasonable way to value to opportunity is to weight each possible reward by its probability of occuring and sum up the results. This quantity is called the expected value, or expectation, of the game. (Note that a simple average is an expected value with equal weight on each outcome.) Another way of saying this is that if you have to pay a fee to play this game, the fee you should be willing to pay is the game’s expected value (or less, of course). Mathematically, if the payoff of the uncertain game is represented by the random variable $X$, the possible outcomes are $x_1,x_2,\dots,x_n$, and the corresponding probabilities of the outcomes are correspondingly $p_1,p_2,\dots, p_n$, then the value of playing the game would be given by

$$E[X] = x_1p_1 + \dots + x_np_n = \sum_{i=1}^n x_i p_i.$$

(The notation $E[X]$ is how we represent the expected value of $X$.) For many bets, this approach has intuitive appeal, and there are many scenarios in which it is used directly.

Flipping coins

With this in mind, consider the following game. You flip a coin until seeing a head. If a head falls on the $k$th flip, you win $2^k$ dollars. The question is: how much would you be willing to pay to play this game?

Well, you might begin, there are infinitely many possible outcomes (head after 1, head after 2,…). Each outcome requires flipping $k-1$ tails and then a single head. Using a fair coin, we have

$$ P(\text{game ends on flip}~k) = \frac{1}{2^{k-1}}\frac{1}{2} = \frac{1}{2^k}. $$

Using the framework of outcome values and probabilities of those outcomes, we can express $x_i$ and $p_i$ for each $i=1,2,\dots$ as

$$ \begin{align*} x_i &= 2^i\\ p_i &= \frac{1}{2^i}. \end{align*} $$

But wait! If this is the case, then for each $i$, $x_ip_i = 2^i/2^i = 1$, so the expected value of the game is actually infinite (the sum of infinitely many 1s)! It seems that, according to this mathematically sound analysis, that we should be willing to participate in this game at any offered price. With this information, how much would you pay to play this game?

What gives?

If you think something is fishy here, you’re right. This problem is so well-known it has a name: the St. Petersburg paradox. It isn’t actually a paradox, but the use of the word refers to the fact that on the one hand, this game has infinite expected value, but on the other, the probability of making a large sum of money is vanishingly small. To put a finer point on it, the probability of winning more than $2^k$ dollars is $1/2^k$! For even moderate values of $k$, this probability is miniscule.

This sort of issue has caused many to reject the exclusive use of expected value as a valuation technique. Some suggest adding a measure of risk (as is customary in any sort of financial application), some advocate for use of the median instead, and still others advocate using the expected value of some utility function applied to the outcomes, rather than the outcomes themselves. There are all interesting areas to explore, but the bottom line is that expectations alone can lead to some head-scratching issues.

Finite resources

One other aspect of the problem to which some attribute the counterintuitive result is the implicit assumption we made that the banker or casino has infinite wealth with which to bankroll the game. Let’s see what happens if the banker only has finite wealth. That is, let’s now suppose that the banker has $W$ (for wealth) dollars with which to fund the game. We’ve introduced a ceiling on the number of rounds before the game ends: $L = \lfloor \log_2 W \rfloor$ (i.e., the number of times you can double the payout before exceeding $W$, rounded down). Having reframed the problem this way, the expected value calculation returns a saner result:

$$ E[X] = \sum_{i=1}^L 2^i \frac{1}{2^i} = \sum_{i=1}^L 1 = L. $$

That is, the expected value of the game is now logarithmic in the banker’s wealth. This means, for example, that if the banker has one billion dollars, the game is only worth about \$30. This accords with our intuition about small probabilities of winning anything significant.

Conclusion

Expected value is one of, if not the most important tool/concept in all of probability! Even so, as we’ve shown in this post, it is not a panacea. If you’re not careful, strange (and fascinating) things might happen.

Solving Wordle

Wed, 12 Jan 2022 00:00:00 +0000

Introduction

For those who don’t know what Wordle is, check it out. It’s essentially a word game that works like the game MasterMind. If you’ve been on the internet in the past couple of weeks, you’ve probably seen your friends or follows posting little images that show how quickly, and by which path, they solved the day’s puzzle. After trying it, I thought it might be fun to try to write some code that solves the puzzle (most of the time). The rest of this post will walk through how I came up with the solution, how I put the code together, and some insights I gleaned using my solver.

So what are the rules?

Before going any further, I want to review the rules. The game proceeds as follows. A target word is chosen and hidden from the player. On each turn, the player guesses a five-letter word. After each guess, the player receives feedback about the letter at each position of their guess. For each position, the player might receive:

Green: if the letter of the guess matches the letter of the target at that position.
Yellow: if the letter of the guess matches the letter at some other position of the target.
Grey: if the letter of the guess is not in the target.

For example, if the target word is “taker” and the guess is “talks”, the feedback would be talks , because the first two letters are exactly right, “l” and “s” are not in the target word at all, and “k” is in the target but in a different position than it occupies in the guess.

Coming up with a solution

At first, I tried to apply some concepts I had been studying as part of a reinforcement learning class I’d been taking online. It’s possible that the formulation I came up with just wasn’t a good one, but a simple approach without any fancy AI turned out to actually work very well. I’ve learned, both through my job and some independent study, that conceptual simplicity is often underrated.

My general approach was simple. After collecting the body of words that Wordle uses (which can actually be obtained pretty easily by inspecting the page source of the game’s webpage), I thought through what the skeleton of an algorithm would look like, and I came up with this:

guesses_made = 0
set current guess to an initial guess
while (guesses_made < 6) and (current_guess is not the target):
 get feedback on current guess
 use feedback to reduce the set of valid words
 make make another guess
 increment guesses_made

One way of looking at this skeleton is that we begin the game with no constraints on which words are and are not valid. As we make guesses and see feedback, we gain additional information that allows us to further and further constrain the set of available words until – hopefully – we’ve narrowed it all the way down and we’re certain of the answer.

As described just above, the algorithm skeleton is missing a few important details, namely:

How does the feedback allow us to determine the pool of valid words we can choose from?
How do we make our next guess given a set of guessable words?

The implementation choices we make to answer those two questions ultimately lead to different algorithms. In this post, we discuss some quick-and-dirty, very simple choices that turn out to perform well, but I’d encourage you to come up with interesting alternatives on your own to see if you can come up with something even better!

Using the feedback

Using the feedback requires specifying what kinds of words each type of feedback allows us to eliminate.

When we receive grey feedback, we know to eliminate all words that contain the grey letter.

When we receive green feedback, at position 2, say, we know to eliminate all words that do not contain the green letter at position 2.

When we receive yellow feedback, there are two kinds of elimination we can perform. If we get yellow feedback in position 3, then we know that the letter in our guess at position 3 cannot be in the target word at position 3, so we can eliminate all words that correspond with our guess at position 3. We can also eliminate any words that do not contain the yellow letter, as we know it must be in the target somewhere.

Finally, there is the problem of words with the same letter repeated multiple times. Thinking things through a little bit, we realize that the number of yellow and green copies of a given letter is a lower bound on the number of copies of that letter that must be in the target word. For example, if we have a yellow “t” and a green “t” in the feedback, we know that the target word must have at least 2 “t"s, so we can eliminate all words with 0 or 1 “t"s.

Making the next guess

Each time we use feedback to cull the set of valid words, we have to then choose from a potentially vast set of remaining words. In order to do this, we have to come up with some heuristic to narrow the field.

In my case, I chose to give each word a score and then chose the word with the highest score (breaking ties randomly if required). To compute the score, I first came up with the distribution of letters in each position. For example, at position 1, maybe “s” was the most common letter, making up 6% of the letters found in position 1 across the set of possible words. If the word under evaluation contains an “s” at position 1, the word would accrue a credit of 0.06 for the “s”. The sum of these credits across the 5 positions determines the word score. At each point, I select the valid word with the highest score.

This scoring system has (at least) one obvious weakness! If the target word has letters in certain positions that are very uncommon for that position, the algorithm will pick other words first and possibly run out of guesses. Trying to figure out how to remedy this would make the algorithm more robust, but I haven’t given it enough thought as of this writing.

How well does the solver work?

With an allowance of 6 guesses, on a random sample of 5k target words, my solver successfully found the target word about 90% of the time in an average of 5 guesses. Increasing the allowance to 9 guesses, it succeeds 98.5% of the time. With 15 guesses, it succeeds on all 5k examples. In the instances where it fails with 6 guesses, there are, on average, about 7 valid choices left. That’s pretty good!

What is the best word to start with?

Before trying to answer this question, it should be noted that “best” in this context depends on your algorithm. Different scoring methodologies, for example, would imply a different ordering on the quality of initial guesses. The results below are obtained using the algorithm we just described; if you vary the algorithm, you might find something different.

For each possible initial guess – around 13k of them – I chose 200 random target words. (I could have chosen more than 200, but I was constrained by computation time.) For each guess, we record the fraction of the 200 problems that were solved successfully and assign that as a score for that initial guess.

Some of the words that achieved the top 5% of scores were

chapt
chimp
germs
compt
match
chems
frump
bumph
spick
crumb
…

and some of the words in the bottom 5% were

nanna
zooea
gazoo
zexes
vairy
roque
navvy
ninon
ozzie
nouny
…

(there were others, but I don’t show them here for brevity). None of the words in the top 5% of scorers repeated letters, while 90% of those in the bottom 5% did, suggesting that when picking your first word, it would be unwise to pick a word with repeated letters.

I also looked at how much of the best initial guesses were made up of the five most common (s, e, a, o, and r) and five least common (v, z, j, x, and q) letters of the alphabet. (Here, most and least common are relative to the 12k Wordle words.) At first, the results seem a bit counterintuitive – 51% of the letters in the worst 5% of first guesses are made of the five most common letters of the alphabet, while only 28% of those the best 5% are! What gives!?

One hypothesis is that feedback on common letters may help significantly narrow the field, but repeating them doesn’t provide much additional information, so it’s worth diversifying. But then, you might ask, why aren’t there words in there with repeated infrequent letters? I imagine that this is probably because there just aren’t that many words to begin with that have multiple qs, js, vs, xs, or zs. The better scorers use common letters, but avoid repeating them, so the fraction of common letters is smaller in that set. If we look at the fraction of letters in high and low scorers that come from the five least common letters, what we see is striking: the high scorers do not contain any letters from the five least frequent letters, while the lower scorers are 14% infrequent.

Conclusion

I hope you found this exercise as fun and interesting as I did. I’ll also be posting the code I used soon, so feel free to have a look at it if you’re interested in seeing what the actual implementation looks like. Even though the insights we arrived at were pretty intuitive, I hope that you enjoyed putting a little bit of rigor to it. Happy Wordling!

Edit (2022-01-17): I’ve posted the code here. In response to some feedback I received about the post, I also changed the word-scoring algorithm to encourage helpful exploration, rather than only using words that satisfy the constraints we’ve accumulated information about.

Linear interpolation in one and two dimensions

Thu, 02 Sep 2021 00:00:00 +0000

Introduction

In this post, I want to demonstrate how helpful visual intuition can be. To do this, we are going to think about how to extend a technique called linear interpolation from one dimension to two. Loosely speaking, techniques for interpolation allow us to use information that we know to hopefully make reasonable estimates of quantities we don’t know. In the rest of this post will first discuss linear interpolation in one dimension, and then use some pictures to figure out what it would mean to linearly interpolate in two dimensions.

Linear Interpolation: 1D

Suppose that you have two points $(x_1, f(x_1))$ and $(x_2, f(x_2))$ and a value $x_1 \leq x \leq x_2$ whose corresponding value $f(x)$ we want to estimate. The first think you might think to do is to assume that $f$ is linear. You would then find the slope $m$ and intercept $b$ of the line connecting the points $(x_1, f(x_1))$ and $(x_2, f(x_2))$, and then use that line to estimate that $f(x) = mx + b$. This is shown visually in the figure below.

It turns out that by rearranging the expression $f(x) = mx + b$ (with $m$ and $b$ expanded as shown in the figure), we can actually express $f(x)$ in a different way:

$$ f(x) = \theta f(x_2) + (1 - \theta)f(x_1), $$

where $\theta = \frac{x - x_1}{x_2 - x_1}$ is the fraction of the total distance between $x_1$ and $x_2$ that is between $x$ and $x_1$. This formulation furnishes another way to think about what linear interpolation does: it estimates $f(x)$ by mixing some amount of $f(x_1)$ with some amount of $f(x_2)$. The amounts of each that are used depends on how close to $x_1$ (or $x_2$) $x$ lies. (To be precise, how much of $f(x_1)$ we use actually depends on the size of the distance between $x$ and $x_2$. As $x$ moves further from $x_2$, the coefficient on $f(x_1)$ should get bigger.)

Linear interpolation: 2D

Now suppose that instead of $(x, f(x))$ pairs, we have $((x, y), f(x, y))$ pairs. Whereas in the prior section, the domain of $f$ is the set of real numbers, in this section, the domain is actually points in the plane. The setting for interpolation in two dimensions is that we have four points in the plane $(x_1, y_1)$, $(x_1, y_2)$, $(x_2, y_1)$, and $(x_2, y_2)$ whose $f$ values we know. We are then given another point, $(x, y)$, and we are trying to estimate the value of $f(x, y)$ (again assuming that $f$ is linear). This setup is shown graphically below.

In this scenario, $f$ actually defines a surface (shown in gray), rather than a curve. In order to estimate the value of $f(x,y)$, we want to come up with a formula for linear interpolation in two dimensions. There are various ways to derive the formula for this**, but here I want to discuss one that I think has an elegant and very intuitive visual interpretation. It turns out that we can borrow the mixture idea from the the 1D case, but instead of a mixture based on the distances along a line, we are going to use areas of subrectangles.

The key here is that we are using areas here as a proxy for 2D “distance”. To sanity check this intuition, note that if $(x, y)$ is one of our four known points, say $(x_1, y_1)$, the area of the subrectangle corresponding to it will be equal to the total area of the larger rectangle, using this method, we can easily see that in this case, $$ f(x, y) = \frac{(x_2 - x_1)(y_2 - y_1)}{(x_2 - x_1)(y_2 - y_1)} \cdot f(x_1, y_1)

0\cdot f(x_2, y_1) + 0 \cdot f(x_1, y_2) + 0 \cdot f(x_2, y_2) = f(x_1, y_1), $$ as we expect.

This intuition can be extended to an arbitrary number of dimensions. In 3D, for instance, we would use areas of 3D sub-rectangular prisms rather than subrectangles.

Conclusion

This isn’t an especially deep idea from a mathematical standpoint, but I thought that it was a nice illustration of how sometimes, visual intuition can take us a very long way. If you’re ever trying to solve some challenging problem and you don’t know where to start, drawing some pictures might be a great way to get the juices flowing.

**The usual, and sort of messy, way to derive the formula for bilinear interpolation is to first interpolate in one of the variables and then the other. Just to give a sense for the way that this get’s pretty cumbersome, we will briefly show how to do it. First we compute $f(x_1, y)$ and $f(x_2, y)$

$$ \begin{align} f(x_1,y) &= \frac{y-y_1}{y_2 - y_1}f(x_1, y_2) + \frac{y_2 - y}{y_2 - y_1}f(x_1, y_1)\\ f(x_2,y) &= \frac{y-y_1}{y_2 - y_1}f(x_2, y_2) + \frac{y_2 - y}{y_2 - y_1}f(x_2, y_1) \end{align} $$

Then we compute $f(x,y) = \frac{x-x_1}{x_2 - x_1} f(x_2, y) + \frac{x_2-x}{x_2 - x_1}f(x_1, y)$, plugging in (1) and (2). In my opinion this in more dimensions quickly becomes unwieldy, and the intuition becomes less clear the more dimensions you try to think about.

Probabilistic interpretation of regularization

Sun, 09 May 2021 00:00:00 +0000

Introduction

If you’ve read enough of my posts over the years, you know that some of my favorite topics to write about are those that can be thought about or studied from different perspectives. In this post, I want to write about regularization, a technique used in machine learning to mitigate a common problem called overfitting – a problem that crops up when algorithms fit their understanding of the world so tightly to a particular dataset, that it isn’t able to make predictions about data that it hasn’t seen. Regularization can be thought about as a term to add to the optimization objective that directly discourages overfitting, or it can be thought of in an interesting statistical way.

A helpful example: simple linear regression

Let’s assume that we’re building a linear regression model. That is, assuming that $b$ represents the number of bedrooms in the house, we assume (more on this in a second) that the relationship between the number of bedrooms and the price of the house is linear:

$$ p(b) = \theta b + \epsilon. $$

Here, $p$ is price, $b$ is the number of bedrooms in the house, and $\epsilon$ is a random number that represents the error in our model, or the part of our model that the data we are using do not explain. What the formula above says is that we believe that we can model the relationship between #bedrooms and price with a linear model. This does not say that we believe that the actual relationship is linear. This is a very important distinction. We believe that the linear model might be useful, not necessarily that it is correct or true.

(Another way of thinking about our equation, or model, is that it says once we know the parameter $\theta$ and the particular number of bedrooms $b_0$, the randomness has been confined to the variation of prices around a known mean: $\theta b_0$.)

Fitting the parameters

One natural way to find the best parameter $\theta$ for a set of data is to find the the value of $\theta$ that literally best fits the input data! To better understand what this means, let’s suppose that we have (#bedroom, price) pairs $(b_i, y_i)$ for $i=1,\dots,100$, and a current guess at a parameter $\theta$. To evaluate $\theta$, a natural measure of how well $\theta$ fits is to find the average (square) error, where the error for each example can be expressed as the difference between $\theta b_i$ (our prediction) and $y_i$ (the actual price). Mathematically, we can write this measure down as

$$ J(\theta) = \frac{1}{100} \sum_{i=1}^{100} (\theta b_i - y_i)^2. $$

Now that we’ve decided what constitutes a good choice of parameter, we can employ tools provided by calculus to actually calculate what value of $\theta$ is best by solving the optimization problem (replacing 100 with the more general $m$) given by

$$ \text{argmin}_\theta \sum_{i=1}^m (\theta b_i - y_i)^2. $$

(This is the value of $\theta$ that minimizes $J$. Without going into detail, in this case, it turns out that the best value is $\theta = \frac{\mathbf y^T \mathbf b}{\mathbf b^T \mathbf b}$, where $\mathbf b = (b_1,\dots, b_m)$ and $\mathbf y = (y_1,\dots,y_m)$.)

Regularization

One important concern when fitting a machine learning model is whether or not your model is too tightly fit to the data that you have. Because models are fit using a finite sample of data, it is possible, even likely, that your data is not representative of what can occur “in the wild.” As such, the model you’ve built may be terrific on the data it used to train, but does not actually generalize to situations it hasn’t encountered. There are various techniques for combating this problem, but the one we will discuss here is one called regularization.

The intuitive motivation

In models with more than one feature, overfitting tends to occur because certain of the features have parameters that are too large, i.e., that their impact is overstated in the model. As such, rather than just finding the parameters that minimize the least squares objective, we want to find small parameters that minimize the objective. For the simple regression case, we would add a term to the objective:

$$ \text{argmin}_\theta ~~ \sum_{y_i, b_i}(\theta b_i - y_i)^2 + \frac{\lambda}{2} \theta^2 $$

Intuitively, if $\theta$ is large, the objective value that we are trying to minimize will also be large, so the optimizer will not be encouraged to pick that value of $\theta$, even it fits the data pretty well. Adding this term causes the optimizer to trade off goodness of fit and simplicity (in the sense of parameters that aren’t too large). The constant $\lambda$ controls our preferences with respect to that tradeoff: larger values of $\lambda$ will encourage smaller value of $\theta$.

Statistical interpretation

While the intuitive motivation is usually enough, there is a cool statistical interpretation of what is going on here that I think is worth pointing out. If we instead think of finding $\theta$ by carrying out maximum (log) likelihood estimation (MLE), then regularization naturally arises when we add the additional assumption to our model that $\theta$ comes from a normal distribution centered around zero with variance $1/\lambda$ (we can tune $\lambda$ to change the width of the bell curve). Making this assumption essentially pins down a probability density function for the parameter $\theta$: $P(\theta) = \frac{\sqrt{\lambda}}{\sqrt{2\pi}}\exp(-\lambda(\theta - 0)^2/2)$. Taking $\log$s (this doesn’t affect the optimization problem we need to solve), we have

$$ \log P(\theta) = \log\biggr( \frac{\sqrt{\lambda}}{\sqrt{2\pi}} \biggr) - \frac{\lambda}{2} \theta^2. $$

Adding this assumption about the prior distribution over $\theta$ and ignoring constants (with respect to $\theta$), we would need to solve the modified MLE problem:

$$ \begin{align} \text{argmax}_\theta~ \log(P(\mathbf y ~|~ \mathbf b, \theta)P(\theta)) &= \text{argmax}_\theta ~ -\sum_{y_i, b_i}(\theta b_i - y_i)^2 - \frac{\lambda}{2} \theta^2\\ &= \text{argmin}_\theta ~ \sum_{y_i, b_i}(\theta b_i - y_i)^2 + \frac{\lambda}{2} \theta^2 \end{align} $$

which is exactly what we had intuitively motivated in the previous section!

We’ve just uncovered the statistical interpretation of regularization!* Using (this flavor of) regularization is actually imposing a Gaussian prior onto $\theta$. As we force the width of $\theta$’s bell curve to become smaller by increasing $\lambda$, we are encoding the fact that larger values of $\theta$ are less likely and should therefore be penalized more heavily during the optimization process.

Conclusion

In this post, we encountered a cool technique that underlies many statistical models called maximum likelihood estimation (MLE), and showed that a common technique used to combat overfitting actually has a nice statistical interpretation, too!

Happy Mother’s Day to all!

*The regularization we discuss here is called L2 regularization. Regularization comes in other forms too. The most popular other choice is called L1 regularization, and it can actually be interpreted as imposing a Laplacian (rather than a Gaussian) prior.

Solving sudoku as a linear program

Sun, 11 Apr 2021 00:00:00 +0000

Introduction

During the last several months, my wife and I went through a Kenken phase. For those not familiar with Kenken, check out this website. Kenken is a great example of a class of problems that are more broadly categorized as constraint satisfaction problems (CSPs). I wrote some code that generates and solves Kenkens, which I’d like to write about when I have more time, but in this post I want to talk about an interesting, nonstandard way to set up and solve a different, far more popular CSP: sudoku!

What is sudoku?

For those who are not familiar with sudoku puzzles, they are designed as follows. The initial board is a 9x9, partially filled in grid. The goal is for the solver to find (the unique) values for the empty squares so that the board satisfies the following rules:

There can be no repeated values in any row. (Each row must contain each value from 1 up to 9 exactly once.)
There can be no repeated values in any column. (Each column must contain each value from 1 up to 9 exactly once.)
There can be no repeated values in each of the 9 3x3 groups of cells outlined in a bolded black line in the figures below.

An unfilled board (left) and its solution (right).

Modeling the problem

In order to formulate this as an optimization problem we need to come up with a variable to optimize and constraints that tell the optimization algorithm what kinds of solutions are valid. In any sudoku problem, there are 81 cells that need to be filled in (including those that are already populated at the start). Each of those cells has 9 possible values, so we will index our variable $x$ with 3 indices: the first will indicate the row, the second will indicate the column, and the third will refer to a value between 1 and 9. Furthermore, we will require each $x_{ijk}$ to take either the value 0 or the value 1: $x_{ijk} = 1$ if the cell at position $(i, j)$ on the board has the value $k$ and 0 otherwise.

This way of modeling a sudoku board is not the most intuitive one possible, but it will make formulating our constraints easier, which is the topic we turn to next. (If the above wasn’t clear, give it another read before moving on.)

Constraints

There are a few types of constraints we need to respect to make sure that our optimization algorithm comes up with a valid solution to our sudoku problems. We will discuss each type of constraint in turn.

Respecting given values

The first constraint the solution needs to satisfy is that the values that are already provided must be respected. That is, if the first cell is already filled with the value 5, the optimizer is not allowed to change that. By the way we defined $x$ earlier, this would correspond to the constraint $x_{115} = 1$. Similarly, if the lower right value were set to 1, we would have another constraint corresponding to this value given by $x_{991} = 1$. We add a constraint like this for every value that is provided on the initial board.

Each cell contains a single value

This is a relatively simple constraint, but an important one. To model this constraint for cell $(1, 1)$, we would require that $\sum_{k=1}^9 x_{11k} = 1$. Because each entry of $x$ is either 0 or 1, this constraint says that exactly one of the entries corresponding to the first cell must be set. We add a constraint like the one I described for the first cell for each of the 81 cells.

Row, column, and box constraints

We require that each row contain each digit from 1 to 9 . To encode that the digits in row $i$ must be unique, we need to make sure that for each value $k$, we have $\sum_{j=1}^9 x_{ijk} = 1$ (in the sum, $i$ and $k$ are fixed). There are 81 such constraints corresponding to the rows, another 81 can be analogously formulated for the columns, and another 81 can be formulated to model the box constraints.

Consolidating constraints

Once we’ve modeled all the constraints, we can stack them into a single matrix equality constraint $Ax = b$. Although we’ve been discussing $x$ as though it were three dimensional, when we pass the optimization problem to the computer, we flatten it into one long 729-vector (9 rows $\times$ 9 columns $\times$ 9 candidate values per cell). Each row of $A$ corresponds to a single constraint. Coefficients corresponding to the variables that are active in that constraint are set to 1 and the other coefficients in that row are all set to 0. Because all of our constraints have 1s on their right hand sides, we have $b = \mathbf 1$.

Formulating and solving the problem

We can now formulate our optimization problem:

$$ \begin{align*} &\underset{x \in \{0, 1\}^{729}}{\text{minimize}} ~~ 0\\ &\text{subject to} ~~ Ax = \mathbf 1. \end{align*} $$

Because sudokus have unique solutions, we just need to find the $x$ that satisfies our constraints – there will only be one such $x$! Because we just need to find that $x$, the objective value doesn’t have to help us discriminate between competing feasible points for this problem, so we can safely use 0 as our objective function. (This type of problem is known as a feasibility problem.)*

With the optimization problem in hand, the following short piece of code uses CVXPY to solve any sudoku puzzle very quickly:

import cvxpy as cp
import numpy as np

# ... <code that formulates the constraints> ...

x = cp.Variable(N, boolean=True)
A = np.array(list(constrs))
b = np.ones(len(constrs))
objective = cp.Minimize(0)
constraints = [A @ x == b]
problem = cp.Problem(objective, constraints)
problem.solve(solver=cp.ECOS_BB)

The full code (~100 lines) can be found here.

Conclusion

Sudoku is typically framed and solved as a CSP with algorithms that involve some guessing and checking. I thought this was an interesting application of optimization to a problem that maybe doesn’t immediately lend itself to such a formulation. Hope you enjoyed!

*As a note for a slightly more technically inclined reader, a friend of mine pointed out that the algorithm that the solver uses, called branch and bound, ends up degenerating into an exponential tree search, akin to just trying out all possibilities with no way of discriminating between “better” and “worse” points. To remedy this (at least in part), we can use the objective function $x^T \mathbf 1$, which counts the number of ones in any solution. By using this objective instead of $0$, we give the algorithm a way to “prune” the tree it is searching, which may lead to performance benefits. The code implementing this approach would only be slightly different:

import cvxpy as cp
import numpy as np

# ... <code that formulates the constraints> ...

x = cp.Variable(N, boolean=True)
A = np.array(list(constrs))
b = np.ones(len(constrs))
objective = cp.Minimize(x.T @ np.ones(N)) # <-- new objective function
constraints = [A @ x == b]
problem = cp.Problem(objective, constraints)
problem.solve(solver=cp.ECOS_BB)

My first academic publication!

Thu, 01 Apr 2021 00:00:00 +0000

I wanted to share that I’ve been fortunate enough to publish my first bit of academic research that I collaborated on with Nick Moehle, Prof. Stephen Boyd, and Prof. Mykel Kochenderfer! When people ask me what I wish I could have done differently in college, I often lament that I didn’t take the opportunity to get involved with more research. I’m really grateful that I’ve had the opportunity to participate in writing this paper, and hope to collaborate on a few more down the road!

You can find the paper here, and an open-sourced implementation of some of the algorithms we talk about here. Happy to discuss if you’re interested!

Finding eigenvalues

Mon, 08 Mar 2021 00:00:00 +0000

Introduction

Over the past few months, I’ve been working on some optimization-related projects at work. Making optimization algorithms efficient and effective often comes down to command of numerical linear algebra, otherwise known as the intersection of linear algebra and computers. It is one thing to discover an algorithm for certain problems that works well in the ether. It is another entirely to ensure that the algorithm works well once it violently collides with the physics of finite precision computers. As someone who has come to deeply appreciate the power of mixing elegance and implementation, I decided to delve more deeply into the subject by making my way through Numerical Linear Algebra by Trefethen and Bau.

This post works through one of the chapters about developing an algorithm to find the largest eigenvalue and its corresponding eigenvector of a symmetric positive definite matrix $A$.

Review of eigenvalues and eigenvectors

Eigenvalues and eigenvectors are central in applied linear algebra. They have applications across machine learning, communication systems, mechanical engineering, optimization, and many other disciplines. One particularly important application of eigenvalues to our everyday lives is search engines! In fact, Google’s PageRank algorithm (or at least the initial algorithm), is all based on eigenthings. For a great explanation of the original conception of using PageRank to organize the internet, check out the original paper by Sergey Brin and Larry Page. As my good friend Ben reminded me “Eigenvectors power our internet!”

In essence, an eigenvector $v$ of a matrix $A$ is a vector that is exclusively stretched (not rotated) when acted upon by $A$. An eigenvalue $\lambda$ of $A$ that corresponds to $v$ is the stretch factor. Formally, $v$ is an eigenvector of $A$, with corresponding eigenvalue $\lambda$ if we have

$$Av = \lambda v.$$

For the rest of the post, we will assume we’re dealing with a symmetric positive definite matrix.

A helpful characterization

Our first step will be to develop a helpful characterization of eigenvalues. To do this, given a matrix $A$ and a nonzero vector $x$, we consider the problem

$$\text{minimize}_\alpha ~~ \| Ax - \alpha x\|_2^2.$$

We are essentially trying to find a scalar that is the closest approximation we can find to an eigenvalue corresponding to the vector $x$, i.e. an $\alpha$ such that $Ax \approx \alpha x$. We can easily solve this minimization problem by setting the derivative of the objective (w.r.t. $\alpha$) to 0 and solving for $\alpha$. Carrying this out, as a function of $x$, we get:

$$ \alpha(x) = \frac{x^TAx}{x^Tx} $$

What we are interested in are the critical points of $\alpha(x)$ as a function of $x$. Using the vector analog of the quotient rule for taking derivatives, we have

$$ \begin{align*} \nabla_x \alpha(A, x) &= \frac{2Ax}{x^Tx} - \frac{(2x)(x^TAx)}{(x^Tx)^2} \\\\ &= \frac{2}{x^Tx}\biggr(Ax - \biggr(\frac{x^TAx}{x^Tx}x\biggr)\biggr) \\\\ &= \frac{2}{x^Tx}(Ax - \alpha(x)x). \end{align*} $$

Suppose $v$ is a critical point of $\alpha$, i.e. $\nabla \alpha(v) = 0$. For that $v$, we would have $Av = \alpha(v)v$. That is, $v$ is an eigenvector of $A$, with eigenvalue $\alpha(v)$. Conversely, if $v$ is an eigenvector of $A$ with corresponding eigenvalue $\lambda$, then we have $r(v) = \lambda \frac{x^Tx}{x^Tx} = \lambda$. We’ve now shown that the vectors $v$ that make the derivative 0 are exactly the eigenvectors of $A$. For each of those eigenvectors, $\alpha$ produces the corresponding eigenvalue.

This characterization of eigenvalues and eigenvectors is important because it gives us an iterative way to think about these mathematical objects with a definition that is more amenable to computation. The function $\alpha$ is important enough that is has a name, the Rayleigh quotient, and it is crucial to our development of the algorithms below.

Thus far, given an arbitrary vector, we’ve found a way to come up with an eigenvalue-like scalar that corresponds to it. Intuitively order to find a bona fide eigenvalue of $A$, we have to iteratively nudge our initial eigenvector estimate toward eigenvector-hood. As our initial guess tends toward an eigenvector $v$, $\alpha(v)$ tends toward an eigenvalue of $A$.

Power iteration

Power iteration is not our destination but it is a conceptual building block that we will spend a moment on here. Ultimately, it has some limitations, but its ideas will help us later.

The algorithm

The algorithm finds the largest eigenvalue and corresonding eigenvector of a matrix $A$. To do this, it starts with an arbitrary vector $v_0$ and computes $v_i = Av_{i-1}$, for $i = 1,\dots, m$ (normalizing each of the results). It then uses the estimate $v_i$ to compute our $i$th estimate of $\lambda_i$ by computing $\alpha(v_i) = v_i^TAv_i$ (no denominator because $|| v_i || = 1$). As we will show momentarily, as $i \to \infty$, $\lambda$ converges to the largest eigenvalue $\lambda_1$ of $A$ and $v$ converges to an eigenvector $v_1$ corresponding to $\lambda_1$. Before we prove anything, here is code that implements power iteration in the Julia programming language.

function power_iteration(A, iters)
 m = size(A, 1)
 v = rand(m)
 v = v / norm(v, 2)
 λ = 0.
 for i = 1:iters
 # update v
 u = A * v
 v = u / norm(u, 2)
 # Rayleigh quotient
 λ = v' * A * v
 end
 return λ, v
end

In essence, what we’ve provided is a way of finding the largest eigenvalue and its eigenvector beginning from a crude estimate of the eigenvector.

Why does it work?

To show that the sequences of iterates converge in the way that we claimed, we just need to show that the sequence $v_i$ converges to an eigenvector of $A$ ( because we’ve already shown that given an eigenvector, $\alpha$ produces an eigenvalue).

Let’s say that $\{q_i\}$, $i = 1,\dots,m$, make up an orthogonal basis of eigenvectors of $A$ corresponding to the eigenvalues $\lambda_i$ (this set exists because $A$ is symmetric). We can also assume, without altering the proof, that $|\lambda_1| > |\lambda_2| \geq \dots \geq |\lambda_m|$. Because $v_k = c_kA^kv_0$ for some sequence of constants $c_k$ (because of the normalization at each step), we can use the expansion of $v_k$ in the basis $\{q_i\}$ as

$$ \begin{align*} v_k &= cA^kv_0 \\\\ &= cA^k(a_1q_1 + \dots + a_mq_m)\\\\ &= c(a_1\lambda_1^kq_1 + \dots a_m\lambda_m^kq_m)\\\\ &= c\lambda_1^k(a_1q_1 + a_2(\lambda_2/\lambda_1)^kq_2 + \dots a_m(\lambda_m/\lambda_1)^kq_m) \end{align*} $$

Because $\lambda_1$ is greater than all other eigenvalues, as $k \to \infty$, all but the first of the terms in the parentheses in the last line go to zero, so we have $v_k \to c_ka_1\lambda_1^kq_1$, which is a scalar multiple of $q_1$, the eigenvector of $A$ corresponding to the largest eigenvalue. (We do not need to worry about the sign of the constants; the important thing is that the one-dimensional subspace spanned by the $v_k$ is the same as the one spanned by $q_1$.)

Limitations

Unfortunately, power iteration is not really used in practice for a couple of reasons. The first is that it only finds the eigenpair corresponding to the largest eigenvalue. The second is that the rate at which it converges depends on how much larger $\lambda_1$ is than $\lambda_j$ for $j > 1$. If, for instance, $\lambda_1$ and $\lambda_2$ are close in magnitude, the convergence is very slow. There is a modification we can make to mitigate some of these issues, which we’ll discuss next.

Inverse iteration

Let’s see if we can find a better, more reliable way to find these eigenvectors. Suppose that $\mu$ is a scalar that is not an eigenvalue of $A$ and let $v$ be an eigenvector of $A$ with associated eigenvalue $\lambda$. We can show that $v$ is also an eigenvector of $(A - \mu I)^{-1}$ by

$$ \begin{align*} (A - \mu I)v &= Av - \mu v \\\\ &= (\lambda - \mu) v \end{align*} $$

If we multiply on the left by $(A - \mu I)^{-1}$ and then divide on both sides by $\lambda - \mu$, we have $(A - \mu I)^{-1}v = v / (\lambda - \mu)$. In other words, $1/(\lambda - \mu)$ is an eigenvalue of $(A - \mu I)^{-1}$. (The invertibility of $A - \mu I$ follows from the fact that $\lambda_i - \mu \neq 0$ for each $i$.)

(You might be thinking: “What if $\mu$ is exactly equal to or very close to an eigenvalue of $A$?” While we won’t go into detail here, it turns out that these cases doesn’t really cause additional computational issues.)

What’s nice about all this through is that if we take $\mu$ to be a reasonable estimate of one of the eigenvalues $\lambda_i$ (more on this in a bit), then we will have $(\lambda_i - \mu)^{-1}$ much larger than $(\lambda_j - \mu)^{-1}$ for $j \neq i$. We can thus conduct power iteration on $(A - \mu I)^{-1}$ and converge very quickly to an eigenvector of $A$ – essentially because we’ve magnified the difference between one eigenvalue of $A - \mu I$ and the rest. Before we move on, here is code (in Julia) that carries out inverse iteration:

function inverse_iteration(A, μ, iters)
 m = size(A, 1)
 v = rand(m)
 v = v / norm(v, 2)
 # compute the matrix we want to invert for reuse. we could also factor
 B = A - μ * I(m)
 for i = 1:iters
 # applying the inverse matrix is same as solving system
 w = B \ v
 # normalize
 v = w / norm(w, 2)
 end
 return v
end

At this point, we have a way of turning vectors into reasonable eigenvalue estimates (the Rayleigh quotient) and a reasonable way of turning eigenvalue estimates into eigenvectors (inverse iteration). Can we combine these somehow? The answer is yes, and this discussion is the final leg of our journey.

Rayleigh quotient iteration

We can put the two algorithms together by repeating two operations:

Use an inverse iteration step to refine our estimate of the eigenvector using the latest estimate of $\lambda$.
Using Rayleigh quotient to turn the refined eigenvector estimate into a refined eigenvalue estimate.

As the eigenvalue estimate $\mu$ becomes better, the speed of convergence of inverse iteration increases, so that this natural combination yields our best algorithm yet. Without detailing the convergence proof, this algorithm converges extremely quickly: on every iteration, the number of digits of accuracy on the eigenvalue estimate triples!

Here is the code in Julia:

function rayleigh_iteration(A, iters)
 m = size(A, 1)
 v = rand(m)
 v = v / norm(v, 2)
 λ = v' * A * v
 for i = 1:iters
 # inverse iteration step
 w = (A - λ * I(m)) \ v
 v .= w ./ norm(w, 2)
 # Rayleigh quotient
 λ = v' * A * v
 end
 return λ, v
end

Conclusion

In this post, we went through a couple of different algorithms that help us find eigenvalues and eigenvectors. While we typically first learn that eigenvalues and eigenvectors should be thought about in the context of characteristic polynomials and determinants, it turns out that for both theoretical (due to Abel) and computational (ill-conditioning of root-finding algorithms) reasons, an iterative approach is actually required for finding them in practice.

In addition to wanting to cement my understanding of these algorithms as well as possible before moving to the next lecture in the textbook, I thought this was a cool case of different approaches combining their strengths to yield an algorithm more effective than the individual parts.

Thanks for reading!

QR factorization

Sun, 27 Dec 2020 00:00:00 +0000

Introduction

This post was inspired by some conversations with my brother about concepts from linear algebra. I’m writing it mostly to better understand the idea myself, but hopefully some others will find it clear and useful too.

Generally speaking, a matrix decomposition usually starts with a matrix $A$ and asks how we can decompose it into some other matrices that have convenient properties. These algorithms are often useful in making certain common matrix-related tasks, such as solving (potentially large) systems of linear equations quickly much more computationally efficient. You can find a long list of decompositions here, but in this post we’re going to talk about one particular decomposition (or factorization): the QR factorization.

We’ll begin by showing why the decomposition comes in handy for solving linear equations. Once we’ve convinced ourselves that the decomposition is useful, we will then discuss how we go about finding the components that the decomposition finds for us.

What the $QR$-factorization does

Simply put, the $QR$ factorization algorithm takes as input a matrix A and outputs a pair of matrices, $Q$ and $R$, such that $Q$ is orthogonal and $R$ is upper triangular (which means that the elements of the matrix below the diagonal are all 0).

Let’s suppose that we’re interested in solving a system of linear equations given compactly in matrix-vector notation by $Ax = b$, where $A \in \mathbf{R}^{n\times k}$, $x \in \mathbf{R}^k$, and $b \in \mathbf{R}^n$. We will suppose that $A$ is tall or square and that it has full column rank (its columns are an independent set). We will also assume that $b$ is in the range of $A$, so that a solution exists. The variable here is $x$; that is, we want to find $x_1, \dots, x_k$ that simultaneously satisfy all of the equations in the system.

Now let’s magically decompose $A$ into the product of matrices $Q$ and $R$, where $Q$ is orthogonal and $R$ is upper triangular. Then we can rewrite the system we want to solve as $QRx = b$. We can solve the system in 3 steps:

Solve the system $Qz = b$ by left multiplying both sides by $Q^T$ (benefit of $Q$ being orthogonal).
Solve the system $Rx = z$. If you think about what it means for $R$ to be upper triangular, we can solve this part by first obtaining $x_k$ by \begin{equation} x_k = \frac{z_k}{R_{kk}}, \end{equation} then obtaining $x_{k-1}$ by \begin{equation} x_{k-1} = z_{k-1} - z_k\frac{R_{k-1,k}}{R_{kk}R_{k-1,k-1}}, \end{equation} then obtaining $x_{k-2}$, and so on, until we’ve computed all of the $x_i$.

This turns out to be much more efficient than solving the system in the naive way (e.g., computing this and left multiplying the original system by it).

The other great advantage of computing a factorization like the $QR$ factorization is that it can be cached and reused! Suppose that instead of a single system, you want to solve 1000 systems, with with different right hand sides (e.g., $Ax = b_1$, $Ax = b_2$, …, $Ax = b_{1000}$. After computing the factorization once, you can reuse it to make all 1000 solves more efficient! In some sense, you can amortize the cost of the factorization over a bunch of reuses. In real applications that I’ve been a part of developing, using matrix decompositions in this way has resulted in noticeable and impactful speedups.

Finding $Q$ and $R$

So how do we find $Q$ and $R$?

There are a few ways to compute $Q$ and $R$, but in this post I want to walk through the most intuitive of them: the Gram-Schmidt algorithm (GS).

We will describe GS when the input is a linearly independent set of $n$-vectors $a_1,\dots,a_k$ (this implies $k \leq n$). The general idea of the algorithm is that at the $i$th step we construct a vector $q_i$ using $a_i$ as a starting point and removing from it everything that $a_i$ shares with the vectors we’ve already computed in prior steps, i.e., $q_1$ through $q_{i-1}$. By construction, $q_i$ doesn’t have anything in common with any of the vectors computed before it, so the collection $\{q_1,\dots,q_k\}$ are orthogonal. If we divide each vector by its length at each step, the orthogonal collection becomes orthonormal. The vectors $q_i$ become the columns of $Q$.

In order to compute $R$, we need to make the idea of “removing” everything that $a_i$ shares with the vectors we’ve already computed in prior steps" more precise. Suppose we are part of the way through the algorithm. As we’re preparing for the $i$th step, we have the vector $a_i$ that we need to incorporate into our output and the orthonormal collection $q_1,\dots,q_{i-1}$ that we’ve built up so far. For some $1 \leq k \leq i-1$, let $v_k = (q_k^Ta_i)q_k$. If we subtract $v_k$ from $a_i$, let’s see what the result “has in common” with $q_k$ by taking the inner product:

$$ \begin{align*} (a_i - v_k)^Tq_k &= (a_i - (q_k^Ta_i)q_k)^T q_k \\\\ &= a_i^Tq_k - q_k^Ta_i \\\\ &= 0 \end{align*} $$

This means $a_i - v_k$ has nothing in common with, or in math parlance, is orthogonal to, $q_k$! Thus, to make $q_i$ orthogonal to all of the $q_k$, we just set $q_i = a_i - v_1 - v_2 - \dots - v_{i-1}$. The GS algorithm can thus be stated compactly:

For $i = 1,\dots,k$, let $p_i = a_i - v_1 - \dots - v_{i-1}$. Then define $q_i = p_i / ||p_i||$. When you’ve cycled through all values of $i$, return the collection $q_1,\dots,q_k$.

With this in hand, we can now define the entries of $R$. If $p_i = a_i - v_1 - \dots - v_{i-1}$, then we can isolate $a_i$ and obtain

$$ \begin{align*} a_i &= \|p_i\|p_i + v_1 + \dots + v_{i-1} \\\\ &= \|p_i\|p_i + (q_1^Ta_i)q_1 + \dots + (q_{i-1}^Ta_i)q_{i-1}. \end{align*} $$

We will choose $R_{ij} = q_j^Ta_i$ for $ij$; in essence, we’re just picking our entries of $R$ right out of the expression for $a_i$ in terms of the $q_i$.

By defining $Q$ and $R$ as such, we have $A = QR$, as we wanted.

Conclusion

The $QR$ factorization is very useful without being overly abstract. Ultimately, the insight that makes it possible is very intuitive (even though many symbols were harmed in the writing of this post). The method described above is by no means the only way to compute $QR$-factorization. I may go through some of the others in future posts.

An algorithm like the one we considered in this post is one of the most satisfying things about working with and studying mathematics; one moment, you’re thinking about linear independence and orthogonality, and the next you’ve got a very useful, practical algorithm.

(Note: The exposition of the algorithm in this post is inspired by that of this book by Boyd and Vandenberghe.)

Anniversary math

Fri, 27 Nov 2020 00:00:00 +0000

My wife and I got married one year ago today, on 11/27/2019. In honor of this very special day, I wanted to write a special short post showing that, in some sense, we’ve actually been married longer than one year.

There are 365 days in a year (not including leap years). Of those 365, roughly 260 are weekdays. If not for the pandemic, we probably would have spent 3-4 waking hours together per weekday (0 in the morning and 3-4 after we got home from work/school). Of the remaining 105 non-weekdays, we might have spent 9-10 waking hours together. For one year of marriage, without considering small exceptions here and there, we’d thus expect roughly 10 * 105 + 4 * 260 = 2090 waking hours, or 87 waking days, spent together.

Because of COVID-19, we’ve spent roughly 3/4 of our marriage quarantined in lockdown. Instead of 3-4 waking hours together on weekdays, we’ve been spending roughly 12-14 waking hours together per weekday during these unprecedented times. Assuming that we also spent a bit more time together on weekends, say 4 additional hours per day (totaling 13-14 hours), one year of marriage has produced approximately 1/4 * 2090 + 3/4 * (365 * 14) = 4355 waking hours (181 waking days) spent together. While the math is admittedly not entirely rigorous, this past year seems to have actually equated to over 2 years of waking marriage!

This calculation obviously does not take everything into account, but when I think about it, I realize how grateful I am for having spent all this time with my wife over the past year and look forward to many more happy, healthy years ahead. To Alexandra, happy 1 (2?) year anniversary; to everyone else, happy Thanksgiving!

Fibonacci with linear algebra

Tue, 27 Oct 2020 00:00:00 +0000

Introduction

After writing a post about one interesting way to arrive at a closed-form for the $n$th term of the Fibonacci sequence, a friend pointed out a few alternative ways to get there, one of which felt particularly natural. It requires some linear algebra, so I guess that in some sense, it could be considered “unnatural,” but I especially like it because the argument’s flow requires fewer arbitrary-seeming leaps. With that said, this post will be brief (as I’m a little busy at the moment), so if there’s anything that doesn’t make sense or is incorrect, please do reach out and let me know.

Fibonacci matrix

The Fibonacci sequence can be viewed through the lens of matrices. In particular, if we start with the matrix

$$ A = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}, $$

we can see that the Fibonacci sequence can be materialized by repeatedly multiplying $A$ by itself.

To see this, first notice that if we are considering the Fibonacci sequence that starts with $F_0 = 0$ and $F_1 = 1$, then we note that

$$ A = \begin{bmatrix} F_2 & F_1 \\ F_1 & F_0 \end{bmatrix}. $$

Suppose now that

$$ A^{n-1} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^{n-1} = \begin{bmatrix} F_{n-1} & F_{n-2} \\ F_{n-2} & F_{n-3} \end{bmatrix} $$

for $n \geq 3$. Then

$$ A^n = A^{n-1}A = \begin{bmatrix} F_{n-1} + F_{n-2} & F_{n-2} + F_{n-3} \\ F_{n-2} + F_{n-3} & F_{n-2} \end{bmatrix} = \begin{bmatrix} F_n & F_{n-1} \\ F_{n-1} & F_{n-2} \end{bmatrix}. $$

Thus, to get the $n$th element of the Fibonacci sequence, we need only get the upper left entry of $A^n$. If we had a fast way to obtain $A^n$ instead of actually carrying out iterated matrix multiplication, we could obtain the $n$th element without doing very much work.

Diagonalizing $A$

To do this, we will look to diagonalize $A$. This means that we will try to write

$$ A = VDV^{-1} $$

where $D$ is a diagonal matrix and $V$ is a matrix with eigenvectors of $A$ as columns (we have to actually find $D$ and $V$ that work). What’s nice about a diagonal representation is that

$$ A^n = (VDV^{-1})^n = VDV^{-1}VDV^{-1}\dots VDV^{-1} = VD^nV^{-1}. $$

If $D$ is written as a matrix with $\lambda_1,\dots,\lambda_n$ on the diagonal, then $D^n$ is simply $D$ with the elements on the diagonal each taken to the $n$th power, like so:

$$ D = \begin{bmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{bmatrix} \implies D^n = \begin{bmatrix} \lambda_1^n & 0 \\ 0 & \lambda_2^n \end{bmatrix}. $$

Now that we’ve laid out our approach, it’s time to carry it out. (I will spare the algebra required to keep this post brief.)

Using $A$’s eigenvalues

The eigenvalues of $A$ (as you might expect if you read the earlier post about Fibonacci) are $\lambda_1,\lambda_2 = \frac{1 \pm \sqrt 5}{2}$, and the corresponding eigenvectors are $v_1, v_2 = \begin{bmatrix} \frac{1 \pm \sqrt 5}{2} \\ 1 \end{bmatrix}$. Thus, we have

$$ D = \begin{bmatrix} \frac{1 + \sqrt 5}{2} & 0 \\ 0 & \frac{1 - \sqrt 5}{2} \end{bmatrix} ~~~~ V = \begin{bmatrix} \frac{1 + \sqrt 5}{2} & \frac{1 - \sqrt 5}{2} \\ 1 & 1\end{bmatrix}. $$

With these in hand, we see that $A$ is indeed diagonalizable, so that $A^n$ can be written as $VD^nV^{-1}$, and voila! with two matrix multiplications and exponentiating the diagonal entries of $D$, we can very quickly and efficiently come up with large Fibonacci numbers quickly.

Fibonacci with difference equations

Tue, 13 Oct 2020 00:00:00 +0000

Introduction

The Fibonacci sequence is a mathematical object that has garnered much fascination over the centuries due to its simplicity and knack for rearing its head in all sorts of unexpected places. In this post I want to lead you to a different way of thinking about the sequence that is wholly unintuitive. You can only really arrive at it by using algebra in conjunction with some lucky guesses, as we’ll see. But before we dive into all that, let’s start with a bit of background.

At the beginning of the 13th century, Fibonacci first wrote about the eponymous sequence while pondering a sensible way to model the evolution of a rabbit population over time. As we will discuss below, the definition of the sequence is delightfully simple, yet it seems to show up in a bunch of different places, including:

The way that branches of trees propagate in nature.
Numbers of flower petals on certain kinds of flowers.
The number of possible ancestors on a human $X$-chromosome.
Any place you’ve heard about the golden ratio $\Phi$.

It is an object that is a delicious blend of simple and deep. So simple to play around with, in fact, that the reason I thought to write this post was actually the result of an interview that took an interesting (and fun!) turn.

One of my colleagues asked a candidate with some technical expertise to write some code that would do something with Fibonacci numbers. In response, the candidate cited this crazy-looking formula to my colleague without justification. After the interview, my colleague came back to our bay of desks interested as to whether I had ever seen the formula. When I said that I had, he asked if I knew how to prove that it was correct. While I didn’t know how to on the spot, I vaguely recalled some concepts from a class I took during undergrad that I thought might help. Together with some googling, I put together what we will work through below. Looking at the work I had done once I finished, I decided that what we’re about to embark on is a cool example of the type of visibility math can give us into things we might not be able to see otherwise. Let’s go!

Mathematical background

A sequence is just a list of finite, or even infinite, length. As examples, $1, 2, 3, 4, 5$ is a sequence, as are $2, 4, 6, 8, 10, \dots$ (the even numbers) and $2, 3, 5, 7, 11, \dots$ (the prime numbers). Each of these sequences have a first element, a second element, a third element, etc., so when we want to prove general assertions about sequences without explicitly writing down their elements, we denote the elements by $a_1$ (the first element), $a_2$ (the second element), $a_3$ (the third element), …. For the sequence $1, 2, 3, 4, 5$, we would say that $a_1 = 1$, $a_2 = 2$ and so on. Another important characteristic that usually comes along with sequences is some kind of rule that tells you how to come up with a given element. That is, many sequences come with a formula, in terms of $i$, that can produce the value of $a_i$. In the case of the even numbers, that formula is $a_n = 2n$ (to compute the ith element, calculate $2 \times i$). To get the 3rd even number, compute $2 \times 3 = 6$. To get the 37th, compute $2 \times 37 = 74$.

The Fibonacci sequence is a recursive sequence. This means that it has a special kind of rule whereby the next term is defined in terms of the previous terms. The rule that defines the Fibonacci sequence is:

$$F_n = F_{n-1} + F_{n-2}.$$

In English, this says that $n$th element of the Fibonacci sequence is the sum of the two previous elements; the 100th element is the sum of the 99th (which is the sum of the 98th and 97th, and so on) and the 98th (which is the sum of the 97th and 96th, etc.). Each of the two previous elements is the sum of the two elements before them, and so on and so forth. But, you might say, this can’t go on forever, there has to be some bottom at which the recursion ends right? There is indeed! In order to define a recursive sequence, you need the first few elements (usually one or two) and a way of using previous elements to create new ones. In our case, we will set $F_0 = 0$ and $F_1 = 1$.

You’ll notice that in order to calculate the $n$th element of this sequence, you have to do a good deal more work than you do to calculate the $n$th even number. To get the $n$th Fibonacci number, you have to traverse the chain all the way back to the beginning for each element you want to calculate, while the even numbers have what is known as a closed form, i.e. the aforementioned formula in terms of $n$ that you can just plug $n$ into to get the desired element. The question we’ll tackle for the rest of the post is: Does the Fibonacci sequence have a closed form? In other words, is there a formula, in terms of $n$, that I can just plug into to get the $n$th element of the Fibonacci sequence?

Using difference equations

First, let’s restate the problem. We want to find some formula for $F_n$ that satisfies $F_n = F_{n-1} + F_{n-2}$, or $F_n - F_{n-1} - F_{n-2} = 0$. Our first quantum leap is to guess that $F_n = m^n$ for some real number $m$ that we’re going to calculate. By guessing that $m^n$ is a solution, we are saying that $m^n - m^{n-1} - m^{n-2} = 0$. It’s easy to see that $m = 0$ satisfies this equation, but that’s boring, so let’s look for an $m \neq 0$.

Because we are now considering only nonzero values of $m$, we can divide $m^n - m^{n-1} - m^{n-2} = 0$ by $m^{n-2}$ on both sides so that we are now looking at values of $m$ that satisfy

$$m^2 - m - 1 = 0$$

(this is called the characteristic equation).

Using the quadratic formula you probably learned about sometime during middle school, we find two values of $m$ that work: the first is $\frac{1 + \sqrt{5}}{2}$ and the second is $\frac{1 - \sqrt{5}}{2}$. We will refer to these two values as $m_1$ and $m_2$ respectively. Recall that quadratic equations don’t always have real roots - sometimes they’re complex (i.e. they’re of the form $a + bi$ where $a,b$ are real numbers and $i$ is $\sqrt{-1}$)! For our second and last quantum leap, we’re going to take for granted the fact that when you’re solving what’s called a (deep breath now) second order homogeneous difference equation with constant coefficients (which $F_n - F_{n-1} - F_{n-2} = 0$ is), if the characteristic equation has distinct real roots (in our case, if $m_1$ and $m_2$ are real and different from one another) then the solution to our difference equation (our formula for $F_n$) has the form

$$ F_n = Am_1^n + Bm_2^n $$

where $A$ and $B$ are constants. We already found $m_1$ and $m_2$, so we just need to find $A$ and $B$. For this, we use the fact that we start the Fibonacci sequence at 0 and then 1, i.e. $F_0 = 0$ and $F_1 = 1$. Let’s use these to finish the problem. Using $F_0 = 0$, we have

$$ F_0 = 0 = Am_1^0 + Bm_2^0 = A + B $$

so that $B = -A$. Then we can use $F_1 = 1$ to see that

$$ F_1 = 1 = Am_1^1 + Bm_2^1 = A \frac{1 + \sqrt{5}}{2} + B \frac{1 - \sqrt{5}}{2}. $$

Because $B = -A$, this is the same as

$$ 1 = A \frac{1 + \sqrt{5}}{2} - A \frac{1 - \sqrt{5}}{2} = A\biggr(\frac{1 + \sqrt{5}}{2} - \frac{1 - \sqrt{5}}{2}\biggr) = A\sqrt{5} $$

so $A = \frac{1}{\sqrt{5}}$ and $B = -A = -\frac{1}{\sqrt{5}}$. Thus,

$$ F_n = \frac{1}{\sqrt{5}}\biggr(\frac{1 + \sqrt{5}}{2}\biggr)^n -\frac{1}{\sqrt{5}}\biggr(\frac{1 - \sqrt{5}}{2}\biggr)^n. $$

Pulling out the $\frac{1}{\sqrt{5}}$, we have

$$ F_n = \frac{1}{\sqrt{5}}\biggr(\biggr(\frac{1 + \sqrt{5}}{2}\biggr)^n - \biggr(\frac{1 - \sqrt{5}}{2}\biggr)^n\biggr). $$

Implementation

Just to drive home why I think this is cool: we started out with a sequence defined by adding integers to each other in a pretty simple way and by using some techniques that feel kind of heavy and opaque, we waved a magic wand, $\sqrt{5}$ showed up and we produced a totally alien and unfamiliar, yet correct representation of the elements of the Fibonacci sequence. To check our work I wrote a short computer program that computes the first 10 fibonacci numbers:

(Note: the lines that start with # are not actually code, they are comments to help guide the reader.)

def nth_fibonacci_number(n):
 # Given a number n as input, this function computes
 # the nth Fibonacci number

 # Reflects our assertion that F_0 = 0
 if n == 0:
 return 0

 # Reflects our assertion that F_1 = 1
 if n == 1:
 return 1

 # For any number n greater than or equal to 2, use the
 # formula we came up with. The symbol * is multiplication,
 # / is division, ** is exponentiation and math.sqrt takes a
 # square root. The below is one line, the "\" character is
 # just a python technicality.
 nth_fib = (1 / math.sqrt(5)) * ( ((1 + math.sqrt(5)) / 2) ** n \
 - ((1 - math.sqrt(5)) / 2) ** n)

 # Return the result
 return nth_fib

When I ran this code with $n = 0, 1, 2, 3, …, 9$, I got

$$ \begin{align*} F_0 &= 0\\ F_1 &= 1\\ F_2 &= 1\\ F_3 &= 2\\ F_4 &= 3\\ F_5 &= 5\\ F_6 &= 8\\ F_7 &= 13\\ F_8 &= 21\\ F_9 &= 34 \end{align*} $$

which is what you’d get if you went and calculated the first 10 Fibonacci numbers the usual way.

Conclusion

In my opinion, what we did here is far less important than how we did it. We started with an object that appeared simple. We then pulled techniques out of the mathematical netherworld to twist it into something entirely unrecognizable – even, dare I say, scary. Yet somehow, once we were done with the trickery and misdirection, we found something in our hand that… well… just worked. Why did it work? Why should it even exist? Because someone curious enough reached into the abstract and found it. It’s as simple as that.

An introduction to convex optimization

Fri, 04 Sep 2020 00:00:00 +0000

The General Idea

Over this past summer (2020), I took Stanford’s EE364A course, which is about a subdiscipline of mathematical optimization called convex optimization. I learned that it has myriad applications all over engineering, finance, medicine, signal processing, and many other seemingly disconnected fields. In this post, I want to discuss what convex optimization is and what makes it so useful as a problem solving technique.

In as general a sense as possible, we humans solve optimization problems all the time. We find ourselves in situations where we have to make decisions about which choice will lead to the best outcome (or the least bad outcome in some unfortunate cases), but most of the time, the space of possible decisions is constrained. When we find ourselves saying things like “If the stars aligned, I would…,” or, “In a perfect world, I’d choose…,” we’re usually lamenting the fact that the decision we’re facing would be easier to make if our options were not constrained.

The field of mathematical optimization attempts to bring mathematical formalism to the above-described decision making process. A mathematical optimization problem has a few components:

Decision variable: This represents the space of decisions you have to make. Solving an optimization problem involves finding the “best” value of a decision variable, where best is defined by your choice of…
Cost/Profit (a.k.a. objective) function: This represents some negative/positive utility of a decision.
Constraints: These constrain the space of decisions you can make. The constraints imply a feasible set, which is a set of allowed decisions.

Solving an optimization problem with these components amounts to finding a feasible value of the decision variable that minimizes (or maximizes) the cost (or profit) function. We will see examples of this below.

While mathematicians have managed to develop a deep, rich mathematical theory around how to reason about and solve different kinds of mathematical optimization problems, unfortunately, many (most) optimization problems are computationally very difficult to find optimal solutions for. For problems of significant enough size, this effectively renders them intractable.

There is, however, a certain interesting class of optimization problems that can be solved very efficiently: these are called convex optimization problems. What makes a convex optimization problems easier to work with is the “shape” of its objective function and its feasible set. In order for an optimization problem to be convex, the objective function and feasible set must be convex.

Convex sets and convex functions

In an effort to not use any math symbols in this post, I’m going to resort to pictures to describe what convexity means and looks like for sets and functions. A set is convex if anytime you pick two points in a set, the line between those points is also inside the set. A picture from Chapter 2 of Convex Optimization by Boyd and Vandenberghe provides the idea:

In the left set (all the points on the inside of the hexagon), pick any two points you want. Now draw a line segment between them. Notice that no matter what pair of points you pick, all points along the line segment between them will lie inside the hexagon! In the right set, on the other hand, we see pair of points (there are actually many, can you find another pair?) for which the line segment between them is not fully contained within the set. The set on the right is thus not convex.

Loosely speaking, a function is convex if it has upward curvature. The simplest example that gets the point across is to think of a smile. In three dimensions, a simple example is a bowl. In more mathematical parlance, upward curvature means that any tangent line (or, in higher dimensions, plane) that you draw through any point on the function lies below the function itself. In the picture on the next page, the red and green lines are tangent to the black curve. Notice that the red, green, and any other tangent line you might draw lie or would lie beneath the black curve.

That’s it! Those properties and some clever reasoning are enough to ensure that if a problem is convex, you can probably solve it efficiently with the right software. As long as the space of allowed decisions is a convex set and the objective function is convex, you’re off to the races!

Examples

Before closing this post, I want to describe two examples of how you might translate an industry problem into a convex optimization problem.

The first has to do with radiation treatment planning for cancer patients. The goal is to figure out how to schedule a radiation treatment plan (over some period of time) that trades off damage to a patient’s health with shrinking the size of a tumor. For this problem, we set it up as follows:

The decision we have to make is how much treatment to administer in each time period.
The goal is to minimize the maximum damage to the patient over the entire course of the treatment.
We are constrained by:
- A maximum allowed dose in each period.
- Wanting the tumor to be below some maximum tumor size.
- The way the patients health changes with treatment over time.

With a few mathematical tricks, this problem can be formulated as a convex problem and solved very efficiently. As you might imagine, balancing patient health and tumor size is obviously critical for the health and longevity of patients. I think this is a great example of one case in which convex optimization allows us to threat that needle very precisely. (Note: This problem was actually a problem on the final exam in the course I took. It was derived from real research by the professor.)

Another example of the effectiveness of convex optimization comes from a very different application: finance. In this example, we want to choose a portfolio that minimizes risk while achieving a particular expected return. In its simplest form, this is a convex optimization problem that can be formulated as follows:

The decision we have to make is to pick a portfolio (from a universe of a bunch of stocks).
The cost of the portfolio is the amount of risk the portfolio holds. We want to minimize this quantity.
We are constrained by the expected return that we want to achieve. Any portfolio we pick has to achieve a particular expected return.

We can add other constraints (e.g., no short-selling) and add terms to the cost function (e.g., tax liability, transaction costs, which we would also want to minimize), but optimization problem we just formulated, in some form or other, is at the core of most of the portfolio construction being done in industry today.

Conclusion

Examples of convex optimization’s uncanny effectiveness and ubiquity are everywhere, but there’s an important point I want to stress before we close. In each of the examples above, the accuracy and utility of the output depends on very human choices about the problem setup. In the treatment planning example, it depends on the models the mathematicians and doctors come up with for how patient health changes and how tumor size changes. In the portfolio construction example, it depends on how good our projections of expected returns are. Some of what I showed and talked about in this post is heavily mathematical, but modeling these problems well is truly an art. So while the math is important and the engine that makes all of this possible, none of it really works without consistent communication and collaboration with domain experts.

Finally, I want to thank Stephen Boyd, Anqi Fu, and the rest of the EE364A staff for a fantastic course. I really cannot overstate how excellent the class was. If you’re interested in seeing what some of this is about in more detail, a free version of the class is available on YouTube and the lecture slides and course textbook are available for free here.

I did it! No math symbols!

How many infinities are there?

Sun, 05 Apr 2020 00:00:00 +0000

(This post assumes you’ve read, at least, this and this.)

All of the posts on infinity that I’ve written to this point have pointed to two different infinite sizes, or cardinalities. The first is the countable kind, the kind we associate with the natural numbers $\mathbf{N}$. The other is the kind we associate with the real numbers, $\mathbf{R}$; we call $\mathbf{R}$ an uncountable set.

Before we proved that $\mathbf{R}$ has a different size than $\mathbf{N}$, we made a convincing intuitive case that there is really only one kind of infinity. That infinities come in at least two varieties was one of many counterintuitive, foundational results Cantor added to the foundation of rigorous transfinite mathematics. But there is yet another question we have still not answered: Are there more than two transfinite cardinalities, or do all infinite sets have either $|\mathbf{N}|$ or $|\mathbf{R}|$ elements?

This post concerns a result that Cantor proved in the same 1891 paper in which his diagonalization argument for the uncountability of the reals appears. It is called Cantor’s Theorem, and it shows that there are actually infinitely many infinities. What does that mean? How is that possible? Let’s dive in and see.

Before we get into it, we establish a small amount of notation. If $a$ is an element of $A$, we write $a \in A$. The size of the set $A$, or $A$’s cardinality, is denoted $|A|$. The power set of a set $A$, denoted $P(A)$, is the set of all subsets of $A$. For example, the power set of $A = \{1, 2\}$ is $P(A) = \{\emptyset,\{1\}, \{2\}, \{1, 2\}\}$, where $\emptyset$ is the empty set. There is actually a straightforward argument that shows that if $|A| = n$, then $|P(A)| = 2^n$. It goes like this. Each element $a$ in $A$ is either in or not in each subset of $A$. Thus, constructing a subset requires $n$ choices, each of which is between two options (“in” or “not in”); the number of such subsets is thus $2 \times \dots \times 2 = 2^n$.

Cantor’s Theorem is quite compact. It says that $|P(A)| > |A|$; in English, the number of subsets of $A$ is strictly greater than the size of $A$. But based on what we just showed, shouldn’t this be obvious? For any $n \geq 0$, $2^n > n$, after all. What is so revolutionary or helpful about Cantor’s Theorem? As is often the case with theorems in set theory, the subtlety stems from having to extend the result to infinite sets.

The beauty of Cantor’s Theorem is that with one elegant argument, Cantor proves the above for any set, finite, countably infinite, or uncountably infinite. The way he does it is by using a proof technique called proof by contradiction. The technique consists of the following steps:

Assume the opposite of what you want to show. Show that the proposition from (1) leads to some absurdity. The assumption that led to the absurdity must be false, so the opposite (your original claim), must be true.

The proof proceeds as follows. First, we assume that $|A| = |P(A)|$. In order to handle infinite sets, this means that we are assuming that each element $a \in A$ can be matched with exactly one element $S_a$ of $P(A)$ (a subset of $A$) with none left over. For each element-subset pair, $(a, S_a)$, either $a \in S_a$ or $a \notin S_a$. Now consider the set of $b \in A$ for which $b \notin S_b$. That is, consider the set of all elements that are not contained in the sets they map to; because this group of elements is a subset of $A$, we can refer to it as $S_c$ for some $c \in A$. We now have to answer a simple question: Is $c \in S_c$?

If $c \in S_c$, then $c$ is an element that is not in the set it maps to, namely $S_c$… which is absurd. But then surely, $c \notin S_c$… right? If $c \notin S_c$, though, it means that $c$ is an element that is in the set it maps to, namely $S_c$. But if it is in $S_c$, we get into the same pickle we were in in the first case. Thus, $c$ cannot be in $S_c$ and $c$ cannot not be in $S_c$. We have thus reached our desired contradiction! Our assumption, that $|A| = |P(A)|$, must be false.

We’re not quite done, though. We’ve shown that $A$ and $P(A)$ are not the same size, but which is bigger? Well the $S_c$ that did not have a matching $c$ was a member of $P(A)$, so we must have $|P(A)| > |A|$, proving the theorem.

Cantor’s Theorem answers this post’s original question. Can you see how? It says that if you take any infinite set $A$, its power set $P(A)$ furnishes a larger infinity. But then we get an even larger infinity when we form $P(P(A))$, and a yet larger one when we consider $P(P(P(A)))$.

If two infinities weren’t enough, now you’ve got as many as you like.

MSE = Bias² + Variance

Mon, 23 Sep 2019 00:00:00 +0000

Introduction

In statistics, the overarching goal – in some sense – is to figure out how to take a limited amount of data and reliably make inferences about the broader population we don’t have data about. To do this, we study, develop, and use mathematical objects called statistics or estimators. Formally, these are functions of other objects called random variables, but, for the moment, it suffices to think of them as ways of using limited amounts of data furnished by a part to learn about the whole.

As an example, let’s say that you wanted to find out the average height of all humans on earth, and let’s further suppose that the actual average human height is 3.5 feet. You might take a random sample of 1000 people, add up all heights and divide by 1000. The height that you get from that procedure, the mean height of your sample, is an estimator of the actual population height; let’s call it $A$. Alternatively, let’s say you again took that same sample of 1000 people, disregarded it and decided that your estimator of the mean population height was going to be zero feet, zero inches; call this (rather silly) estimator $B$.

Before developing any formal measure of the quality of an estimator, think about the above two estimators. Does one seem “more reasonable” than the other? For any parameter you want to estimate using some data you’ve collected, there are infinitely many estimators you could come up with; one natural concern we might seek to address is how to mathematically distinguish quality estimators from useless estimators.

Mean squared error = bias + variance

Mean squared error (henceforth MSE) is an attempt to formally capture the difference in the quality of different estimators. It is defined as the expected value of the square of the distance between the estimator’s value and the true value of the parameter you are trying to estimate with it. In the example above:

The true value of the parameter (average human height) is 3.5 feet. The estimator is the sample average of 1000 people that you calculated. Because the estimator is a function of random variables (heights of people you sampled), it, too, is a random variable, say $X$, with some distribution. We can therefore think about the expected value of some function of $X$ - in our case, the function is $f(X) = (X - 3.5)^2$. To compute the expected value you take a weighted average of all possible values $X$ can take and weight each one by the probability of seeing that outcome. (Don’t worry too much about why we’re squaring; it makes calculus easier.)

Mathematically, we write

$$ \text{MSE}(\hat y) = E_y((\hat y - y)^2) $$

where $\hat y$ is the estimator and $y$ is the true parameter value.

Intuitively, if we expect the estimator to, on average, stray far from the true value of the parameter, the estimator is probably not very good. Alternatively, if that expected value is close to 0, it means that the estimator deviates very little from the actual parameter, i.e. it’s a great estimator! With some tedious algebra, we can actually show that

$$ \begin{align} \text{MSE}(\hat y) &= E_y((\hat y - y)^2) \\ &= \dots \\ &= E((\hat y - E(\hat y))^2) + (E(\hat y) - \hat y)^2 \\ &= \text{Var}_y(\hat y) + (\text{Bias}_y(\hat y))^2 \end{align} $$

Bias and variance are two very important qualities of estimators that help us understand how they relate to the real value you’re attempting to estimate:

Bias

Bias tells you the difference between the expected value of your estimator and that actual value of the parameter. To intuitively grasp this, imagine throwing several darts at a board, all of which strike the board close together, but off of the true center. You’re throws had high bias; in some sense, your “average” throw’s position would have been consistent, but generally off the mark.

Variance

Variance tells you how much the estimator tends to move around it’s expected value. If the estimator (just a function) takes values spread widely around it’s mean, variance will be high. Suppose you know two basketball players, both of whom average 15 points per game. One of them scores 30 one game and 0 the next, and the other scores around 15 consistently. While both players have the same scoring average, one of their scoring patterns is high variance – i.e. deviates pretty far from the mean – and the other is low variance.

With this decomposition in mind, we can see that if we choose to use MSE as our metric of estimator quality, it actually decomposes very nicely into two intuitively appealing sources of error. Therefore, to lower MSE, we need to either reduce bias or reduce variance (or reduce both!). It isn’t always so simple, though, as it is possible that by reducing one, you might have to raise the other, hence the name, the bias-variance tradeoff. There are a bunch of techniques that are actually quite useful in practice which leverage the decomposition of MSE into bias and variance.

In algorithms

For example, random forests are estimators that average the opinions of a bunch of high variance (aka not very general) decision trees to, in aggregate, function as a lower variance estimator. (This technique, a special case of a more general class of bagging algorithms, uses the fact that the variance of an average of $n$ independent estimators decreases as $n$ gets bigger.)

Attacking the MSE problem from the bias side, we have boosting algorithms, wherein the decision trees (or any other base estimator) are trained in sequence. Each tree in the sequence trains itself more tightly to the examples that the previous tree predicted incorrectly. In this sense, you are starting with a silly, low variance, estimator and gradually fitting it more tightly to the data, reducing bias, so that the sequence, all together, becomes more useful.

Conclusion

While I’ve admittedly left out some of the technical details of bagging and boosting, the point is to illustrate that something abstract-seeming like MSE can be understood in a very concrete way, and that the decomposition we discussed can actually lead to practical problem-solving approaches that are actually quite useful.

The weak law of large numbers

Mon, 31 Dec 2018 00:00:00 +0000

Introduction

When I say that Alice is better than Bob at chess, I’m effectively asserting that barring careless mistakes, Alice should always beat Bob. This notion, of Alice “being better than” Bob is easy to wrap one’s head around when the game in question doesn’t have to contend with randomness or uncertainty. What does it mean, though, to say that Alice is better than Bob at a game like backgammon, where dice (a source of randomness) are involved? The rest of this post aims to provide some mathematical machinery to answer such a question.

Whenever one talks about expecting some “eventual” outcome of a game or experiment, he or she is actually invoking a fundamental statistical law: the Law of Large Numbers (henceforth LLN). The LLN is one of a group of fundamental results in probability theory known as limit theorems, which are statements about how certain aspects of random variables stabilize as you make more and more observations of them. In plain English (we’ll get to it’s technical formulation a bit later), the LLN (and we’ll see that there are actually two variants) says that if you have an experiment whose average outcome is a number $m$, then as you try the experiment more and more times, the average value of your collection of outcomes will tend to $m$.

Examples

To get a feel for what’s going on here, let’s look at a few examples that demonstrate the LLN’s ubiquity and importance.

When you play blackjack against the house, its ability to make money hinges on a critical assumption: (without cheating,) even using a probabilistically optimal strategy, the chances that you win a hand is less than 50%. The house edge might be very very small, but as long as it has some nonzero edge, the house stands to make money in the long run because it is playing lots and lots of hands. This assumption relies on the LLN.
Let’s imagine that a basketball player is in a shooting slump. He regularly shoots around 40% from outside the three point line, but of late he’s only managed to connect on 20% of his attempts. Encouragement offered to such a player typically looks like “Don’t worry, just keep shooting, your shot will come back!” A coach who lifts his player’s spirits this way is also relying on the LLN.
To return to our backgammon example, when we say that Alice is better than Bob, what we’re saying is that in any individual game, it’s possible that Bob beats Alice, but if Alice and Bob were to play 100 or 1000 or 1000000 games against Bob, Alice would end up winning a majority of them. The more games they play, the more obvious the advantage would become.

Formal statement

As I’ve noted in many other posts, it is one thing to have a good idea, another to formalize it mathematically. The concept underlying the LLN is one that most of us intuitively grasp without understanding statistics. But in order to prove it and then use it as a building block to understand other, more subtle results, we need to be able to state it formally.

Before we do, I want to note that there are actually two well known versions of the LLN. We will concern ourselves here with the Weak Law of Large Numbers (the Strong Law is harder to prove but says something similar in spirit). Mathematically, we would state it as follows (don’t worry – we’ll break each piece of this down momentarily): Let $X_1, \dots, X_n$ be independent and identically distributed random variables all having finite average value $m$. Let $\bar X_n = \frac{1}{n}\sum_{i=1}^n X_i$. Then for any $\varepsilon > 0$, $P(|\bar X_n - m| < \varepsilon) \to 1$ as $n \to \infty$.

Let’s dissect this piece by piece and make sure it makes sense:

“Let $X_1, \dots, X_n$ be independent and identically distributed random variables”: This means that each observation $X_i$ is independent of all the others and is produced by the same distribution as all the others. Think of a black box that independently spits out a sequence of $n$ numbers using some fixed, unknown probability distribution. Each number the box spits out would be represented by an $X_i$. (The theorem is going to say what we can infer about the average of the black box distribution’s average value once we’ve made a sufficiently large number of observations.)
“Let $\bar X_n = \frac{1}{n}\sum_{i=1}^n X_i$”: $\bar X_n$ is the average value of the observations.
“Then for any $\varepsilon > 0$”: This is a math trick students usually first come across in real analysis. You basically pick some arbitrarily small value of $\varepsilon$ and then show that some quantity can always be made smaller than the value you chose. (In our case, that value will be the difference between the average value of the observations and the true average $m$.)
“$P(|\bar X_n - m| < \varepsilon) \to 1$ as $n \to \infty$”: The probability that the average value of the observations and the true mean of the underlying distribution differ by less than $\varepsilon$ becomes more and more certain as you take more and more observations.

Read that again if it didn’t make sense. Once all that sinks in, go back and read the formulation; I hope you find that it’s a delightfully compact way of formalizing the intuition we started with. Our final act will be to prove it. If you haven’t heard of Chebyshev’s inequality, read this before attempting the proof.

Proof

We will assume for this proof that the $X_i$ have finite variance $\sigma^2$, though this assumption is not necessary. (It’s just that if we don’t make that assumption, the proof gets more complicated.) First, let’s compute the mean and variance of $\bar X_n$. Because expectation is linear, and $\bar X_n$ is just a sum of RVs each having mean $m$, we have

$$ \begin{align*} E(\bar X_n) &= \frac{1}{n}\sum_{i=1}^n E(X_i) \\\\ &= \frac{1}{n}\sum_{i=1}^n m \\\\ &= mn/n \\\\ &= m. \end{align*} $$

To compute the variance, we use the independence of the $X_i$ to write that

$$ \begin{align*} \text{Var}(\bar X_n) &= \text{Var}(\frac{1}{n}\sum X_i) \\\\ &= \frac{1}{n^2}\text{Var}(\sum X_i) \\\\ &= \frac{1}{n^2}\sum \text{Var}(X_i) \\\\ &= \frac{1}{n^2}n\sigma^2 \\\\ &= \sigma^2/n. \end{align*} $$

Before we continue note that the fact that variance is a decreasing function of $n$ just about makes sense. The more observations I take, the smaller my variance should intuitively be.

Next, we use Chebyshev’s inequality with $\mu = m$, $k = \varepsilon$ and $X = \bar X_n$ to say that

$$ \begin{align*} P(|\bar X_n - m| < \varepsilon) &= 1 - P(|\bar X_n - m| > \varepsilon) \\\\ &\geq 1 - \frac{\sigma^2/ n}{\varepsilon^2} \\\\ &= 1 -\frac{\sigma^2}{n\varepsilon^2}. \end{align*} $$

(The inequality sign is flipped because we have that $1 - \dots$ in there. Otherwise, it’s just plug and play Chebyshev.)

As $n$ gets larger and larger, that rightmost term tends to 0, so the probability of interest is bounded below by 1, and voila! We’re done.

Conclusion

This law is intuitive, deep and also deeply embedded in the way that we as human beings trying to navigate the world deal with uncertainty, so I thought it deserved a post.

Happy New Year to all!

Bounding probabilities with Markov and Chebyshev

Sun, 04 Nov 2018 00:00:00 +0000

Introduction

Very often, finding exact answers is a pain; ballpark estimates usually suffice. When you’re nervous to board a plane, you don’t care to calculate the exact probability that the plane will land safely; you only care that it’s over $99.99\%$. When you’re trying to figure out how long a project will take your team at work, you use approximations throughout your calculation because there are far too many unknowns and variable to compute the exact answer. Will a member of your team get sick and have to take time off? How quickly will your team members learn a new technology? How productive will the new-hire be? Across computer science, there are a myriad problems whose solutions are, in a rigorous sense, intractable to calculate exactly. To solve such problems — they show up all over a variety of industries — we design approximation algorithms that sacrifice some optimality and/or deterministic correctness for gains in efficiency and simplicity.

Ballparking

In probability, one particular set of heuristic techniques we use is a set of inequalities called probability bounds. You would use them for some of the same reasons described above: intractability of an exact probability calculation; lack of need for an exact answer; generality (they don’t make many complex, esoteric or restrictive assumptions, so they apply to lots of different problems). In this post, I want to state and prove a few probability bounds and show how you can apply them. While the estimates might not always give you something useful to work with, it’s good to be aware of how to use the bounds should the opportunity present itself.

Example

As we often do when probability is concerned, let’s think about a sequence of coin flips. Instead of the usual unbiased coin, though, let’s say the coin falls heads with probability $p=1/3$ (and tails with probability $1 - p = 2/3$). In a sequence of 100 flips, what is the probability that greater than or equal to half of them fall heads?

The distribution one would use to model this question is most probably the binomial distribution. If we let the random variable $X =$ # of heads in 100 flips, we would say that $X \sim \text{Bin}(100, 1/3)$ ($X$ represents the number of times a head will fall in 100 flips when the probability of heads on each independent toss is 1/3). In the case of this particular problem, we can calculate the exact probability of 50 or more heads. The formula that computes the exact solution looks like (don’t worry about what the symbols mean — not important, but if you’re interested, take a look at this (https://en.wikipedia.org/wiki/Binomial_distribution)):

$$\sum_{k=50}^{100} {100 \choose k}(\frac{1}{3})^k(\frac{2}{3})^{100-k} \approx 0.00042.$$

Without a bunch of cleverness or a computer, it would be very hard, if not impossible, to carry out that computation by hand. Let’s see if those probability bounds I talked about can lend us a hand. Before we do, we need to introduce two facts about binomial random variables. If $X \sim \text{Bin}(n, p)$, the expected value (average value) of $X$, written $E(X)$, is equal to $np$ (the number of trials times the individual probability of success), and the variance of $X$ (a measure of how much $X$ deviates from its average value), written $\text{Var}(X)$, is equal to $np(1-p)$. With that in mind, follow me.

Markov’s inequality

The first probability bound we’re going to look at is known as Markov’s inequality. Before we technically state it, in plain English, Markov’s inequality tells us about the probability of a random variable exceeding some value. Formally, if $X$ is a positive random variable (the outcomes of your experiments are strictly positive values), then for any real number $a > 0$,

$$P(X \geq a) \leq \frac{E(X)}{a}.$$

Before we prove this, I want to explicitly state that the primary advantage of Markov’s inequality is its generality. It usually doesn’t furnish the most useful bounds, but notice that in order to apply Markov, all we need to know that is that $X$ is a positive random variable. As we will see below, it is also an important building block with which we will come up with better probability bounds. Before we see what it would tell us about the problem we started with, let’s prove it.

Proof

Let the random variable $I = 1$ if $X \geq a$ and $0$ otherwise. The proof is mostly complete when you notice this convenient, but somewhat subtle fact:

$$I \leq \frac{X}{a}.$$

Why? Let’s think about it. If $I = 1$, then, by definition, $X \geq a$. Dividing by $a$ on both sides of $X \geq a$ gives us $\frac{X}{a} \geq 1 = I$. If $I = 0$, the above inequality holds because both $X$ and $a$ are positive. Taking the expectation of both sides (don’t worry — expectation doesn’t change the sign of the inequality), the above inequality turns into $E(I) \leq E(\frac{X}{a})$. We can pull $\frac{1}{a}$ out of the right hand term because you can pull constants out of expectation. Further, by the definition of expected value, $E(I) = 1 \times P(I = 1) + 0 \times P(I = 0) = P(I = 1)$ is just the probability that $X \geq a$ (by our definition of $I$), so the above expression can be rewritten as

$$P(X \geq a) \leq \frac{E(X)}{a},$$

so we’ve finished the proof.

A bound on the example using Markov

Now that we know that Markov’s inequality holds, let’s see what it can tell us about our problem. Recall that we defined $X$ to be the number of flips that fall heads in a sequence of $100$ tosses. In particular, we wanted to know what the odds were that there were more than or equal to $50$ heads, i.e. the probability that $X \geq 50$. Noting that the expected value of our particular $X$ (with $n = 100$ and $p = 1/3$) is $np = 100 \times 1/3$, we can plug these numbers into Markov and see that

$$P(X \geq 50) \leq \frac{100 \times \frac{1}{3}}{50} \approx 0.67.$$

This tells us that the probability that the number of heads we see exceeds $50$ is at most $0.67$. Not so useful, but do you see how simple that was to compute? Given the exact answer we computed above, this estimate doesn’t tell us very much, but the computation was so easy.

Chebyshev’s inequality

The next bound I want to look at is called Chebyshev’s inequality. While it looks and feels a bit different from Markov, it’s spirit is similar. Chebyshev is a statement about the probability that some random variable deviates from its average by a certain amount. Its formal statement is: if $X$ is a random variable with finite expected value $\mu$ and finite variance $\sigma^2$ (those two conditions hold enough of the time), then

$$P(|X - \mu| \geq k) \leq \frac{\sigma^2}{k^2}.$$

In English, the probability that $X$ deviates from its mean by at least $k$ is at most $X$’s variance divided by $k^2$. We will find this bound a bit more useful when we apply it to our problem, but before we do that, we need to prove it.

Proof

If $X$ is a random variable, then $X - \mu$ is a random variable. Furthermore, if $X - \mu$ is a random variable, then so is $(X - \mu)^2$. Now we apply Markov with $X$ replaced by $(X - \mu)^2$ and $a$ replaced by $k^2$, and we have:

$$P((X - \mu)^2 \geq k^2) \leq \frac{E((X - \mu)^2)}{k^2}.$$

The expression in the parentheses on the left side is equivalent to the expression $|X - \mu| \geq k$ (take the square root of both sides). Replacing what’s in parentheses with the equivalent formulation and recognizing the numerator on the right side as the very definition of variance completes the proof because we can rewrite the above as

$$P(|X - \mu| \geq k) \leq \frac{\sigma^2}{k^2}.$$

That proof might have actually been mechanically easier than Markov! In addition, Chebyshev’s inequality is usually more informative and almost as general as Markov, so let’s take a look at what it can tell us about our problem.

A bound on the example using Chebyshev

Recall that $\mu = np$ in our problem is approximately $33.3$ and that the variance is $\sigma^2 = np(1-p) = 100 \times 1/ 3 \times 2/3 = 200/9 \approx 22.2$. The information we’re missing here is what value of $k$ to choose. In this case, we want to know something about the probability that there are more than or equal to $50$ heads. $50 - 33.3 = 16.7$ so if we set $k$ to $16.7$, we will be able to apply Chebyshev to know an upper bound on the probability that $X$ takes a value that is more than $16.7$ away from $\mu$ in either direction. More directly, we would be able to say something about the probability that $X$ takes a value less than or equal to $33.3 - 16.7 = 16.6$ and greater than or equal to $33.3 + 16.7 = 50$. The probability that $X$ is less than or equal to $16.6$ is superfluous, but it doesn’t affect the correctness of our upper bound. By Chebyshev, we can conclude that

$$P(|X - \mu| \geq 16.7) \leq \frac{22.2}{16.7^2} \approx 0.079,$$

so our upper bound is now around $8\%$. Given the result of the explicit calculation we computed first, this isn’t especially tight either, but it’s much better than Markov and it was almost as easy to compute!

Conclusion

There are other, more powerful bounds that I won’t go into here because they’re more involved and this post is already long, but hopefully you’ve gotten an idea about where these sorts of things might be useful.

A different way of thinking about eigenvalues

Thu, 18 Oct 2018 00:00:00 +0000

Introduction

This post’s title is intentionally vague. Usually, I write an introduction that describes the path the post will walk and then I meander down that path from beginning to end, and write a conclusion that sums it all up. After thinking about how to write this post in the most engaging way, it occurred to me that mathematics has often felt the most satisfying to me when I’ve felt as though I was discovering definitions and theorems rather than being read them by professors. My hope is that in addition to conveying the beauty and ingenuity behind what follows, I am able to also pass along some of the wonder that I myself felt during journey. With that, follow me…

Review of linear operators

Suppose we have a vector space $V$. One of the most important objects (if not the most important) we study in linear algebra are structure preserving maps called linear transformations (a.k.a. linear maps). As a quick review, a transformation is (loosely) a function that takes a vector from one vector space and turns it into a vector in another using some rule. In order for $T$ to be called a linear transformation, we need the following two properties to hold:

If $u$ and $v$ are vectors, it must be the case that $T(u + v) = T(u) + T(v)$.
If $v$ is a vector and $c$ is a member of the $V$’s underlying field, it must be the case that $T(cv) = cT(v)$.

In English, (1) and (2) say that linear transformations must preserve addition and scalar multiplication – that is, adding/scaling then mapping must produce the same result as mapping then adding/scaling. For this post, we’re going to focus our attention on linear operators, or linear maps from $V$ to itself.

Invariant subspaces

So suppose $T$ is an operator on $V$. It’s natural to wonder how $T$ behaves with respect to subspaces of $V$. In the case of an operator, it will always be the case that $T$ maps a subspace $U$ to some subspace of $V$, but how $T$ transforms an arbitrary choice of subspace is unclear. Let’s simplify a little bit and think about the neatest-possible set of outputs that $T$ might produce as it acts on vectors from $U$; we are going to focus on subspaces which, with respect to $T$, are in some sense self-contained. In other words, let’s require that $T$ map every vector in $U$ get back into $U$ somewhere. (A more compact way of phrasing our requirement is that we want $T$ to be an operator on $U$.) If $T$ behaves this way with respect to $U$, we say that $U$ is invariant under $T$. (In technical jargon, we say a subspace $U$ is invariant under $T$ if for every vector $u \in U$, $T(u) \in U$.)

I don’t know about you, but I still don’t feel on sure enough footing; let’s simplify further. Instead of letting our invariant subspaces get big and complicated, let’s restrict our focus to $T$’s invariant subspaces of the lowest possible dimension. As a brief aside, the dimension of a vector space is the smallest number of vectors, linear combinations of which can comprise an entire space. For example, consider $\mathbf{R}^2$ (the $(x,y)$ plane). It is a vector space of dimension 2. Why? Because I can make the two vectors $(0,1)$ and $(1,0)$ into any vector (coordinate pair) that I want! How? Notice that $(a,b) = a \times (1,0) + b \times (0,1)$. In the $\mathbf{R}^2$ example, we call $\{(0,1), (1,0)\}$ a basis (“a” basis because there are infinitely many others). A vector space’s dimension is defined as the size of any basis (all bases are the same size).

Eigenvalues

In light of our detour, what might a one-dimensional invariant subspace look like? Well, a one dimensional subspace has a basis of size one, which means that the subspace is made up of linear combinations of a single vector, i.e. its scalar multiples! In math terms – and in this case, I actually think the symbols help – a one dimensional subspace $U$ looks like

$$U = \{ au ~|~ a \in F \},$$

where $F$ is $V$’s underlying field.

Now let’s say this low-dimensional subspace is invariant under $T$. This would mean that $T(u)$ lands back in $U$, and given that $U$ is of dimension one, $T$ must send $u$ to a scalar multiple of itself. In other words, there is $\lambda \in F$ such that $T(u) = \lambda u$.

As you might have guessed by now, $\lambda$ is what we call an eigenvalue and $u$ is it’s corresponding eigenvector.

One of the central focuses of linear algebra centers around understanding the relationship between linear transformations (useful abstractions) and their matrices (computational tools). Eigenvalues and eigenvectors (loosely) allow you to write down computationally friendlier matrices corresponding to linear transformations. (The friendliest of these is known as a diagonal matrix – the only nonzero entries are those along the diagonal stretching from the top left to bottom right corners. You can write down a diagonal matrix if and only if you manage to find a basis of eigenvectors.) Eigenvalues and eigenvectors are typically not motivated particularly well. For a while, I trusted my professors that they were important and useful, but I’d really never seen a way in which they arise naturally.

They eigenvalues are presented as the roots of the characteristic polynomial of an operator. What’s the characteristic polynomial? If $A$ is the matrix corresponding to $T$ (huh?), then the characteristic polynomial is given by $p(\lambda) = det(A - I\lambda)$. The roots of $p$ are $T$’s eigenvalues. Wait, but what is that $\det$ symbol? Determinants, you say? What are those? How do I know that $det(A - I\lambda)$ is a polynomial? To know that, you’d need to know how to unpack all of those symbols, which requires understanding what they are… and before you know it, you’re down so far down so many rabbit holes that you stop thinking and just start accepting. A few days later, your professor moves on to matrix diagonalization and in a haze of all of the other things going on in your life, you’ve memorized a totally opaque technique that you’ve applied correctly just enough times to feel like you understand. This, I believe, is one of the sneakiest and most pervasive ways that beautifully intuitive math manages to pass students by.

Conclusion

Don’t fall victim! When you run up against a concept you don’t understand, keep struggling with it; don’t get lulled into indifference because you can compute a correct answer. Ask questions, there is none too small. From experience, I can tell you that there is an elegance waiting for you beyond the struggle. One so deep, fundamental and profound that you’ll be truly glad you stuck around.

Counting chord intersections: two approaches

Sun, 16 Sep 2018 00:00:00 +0000

Introduction

In non-mathematical disciplines, it is very often the case that approaching the same question in different ways will lead you to different conclusions. One of the beautiful qualities of math and the type of reasoning it requires is that for any given problem, there might be (and usually is) a plethora of different approaches. The discovery of a new approach adds perspective about the problem and allows the solver’s understanding of a particular area to deepen and expand. In helping my cousin with some homework recently (she figured out one of the solutions below before I did), I came across a wonderful example of the way different approaches inform each other. One is far more intuitive than the other, but I didn’t realize there was a more intuitive solution until after solving it in a more complex, algebraic way. Without further ado, let’s begin.

The problem

Let’s say you’ve got a class of 5 students, and you have them arrange themselves as follows. You tell them to stand in a circle and you give each pair of them a string to hold tight between them. If people are dots and the strings between them are the line segments, the arrangement might look like this:

The question we want to answer is:How many intersections are there in the middle of the circle? For the case when there are 5 students, the answer, by inspection of the picture above, is 5 (the vertices of the pentagon in the middle). For the remainder of this post, we will concern ourselves with the more general version of this question, which is: If there are $n$ people, can we find a formula – in terms of $n$ – that tells us the number of intersections in such an arrangement?

The remainder of this post will meander through two approaches; the first algebraic, and the second intuitive.

Algebraic approach

The first thing to do to make the problem more approachable is to simplify it. In this case, we will narrow our focus to the number of intersections that the strings emanating from one person are a participant in. If we can do that, we can multiply by the number of people there are* and divide by the number of times we overcount each intersection, and voila, problem solved. At this point we observe two simple facts:

There are a total of $n$ people.
Each intersection is going to be counted 4 times. Why? Because each intersection requires two strings, each of which has two people holding the ends. Thus, once I’ve come up with my formula for the number of intersections per person, I have to remember that simply multiplying the result by $n$ is going to count each intersection 4 times, so we need to divide the total number of intersections we find by 4.

From (1) and (2), we can deduce that our answer is going to have the form:

$$\frac{n}{4} \cdot \text{intersections per person}$$

The last thing we need to do is figure out the way to express the number of intersections per person. We will derive this in steps:

Each string that crosses the middle of the circle splits the circle into two groups. For simplicity, let’s call them the right-hand and left-hand groups.
The number of edges that intersect the splitting string is exactly the number of strings that pass from the right-hand group to the left-hand group.
Let’s say that there are $k$ people in the right-hand group. Because there are 2 people holding the splitting string, the left group must have the remaining people, i.e. $n - k - 2$. Each of the $k$ people in the right group shares a string with each of the people in the left group, so the number of strings that cross the splitting string is given by $\text{(\# of people in right group)} \cdot \text{(\# of people in left group)} = k (n - k - 2)$.
We now just have to treat each string that a person is holding as a splitting string, use the formula from (3) and add the results up.

If you understand steps (1)-(4), the rest is just boring, but unfortunately necessary algebra. If you’re new to discrete math, you should try to work this out yourself; there are some summations in there that you’ll encounter over and over again and it’s a good way to get familiar with some of them. If algebra isn’t your thing, trust me that this approach works and skip to the intuitive approach below.

The goal now is to simplify

$$\sum_{k=1}^{n-3} k(n-k-2).$$

(We will multiply by $n/4$ at the end.) Our first step will be to distribute the $k$ to get:

$$\sum_{k=1}^{n-3} nk - k^2 - 2k.$$

From now on, I’m going to write the summation without the indices (because typing them is tedious), but they’re implied. Splitting up into three summations, we have:

$$\sum nk - \sum k^2 - \sum 2k.$$

Pulling out constants, this is equal to

$$n\sum k - \sum k^2 - 2 \sum k = (n-2)\sum k - \sum k^2.$$

We know (but if you don’t, try to prove it yourself), that $\sum_{i=1}^m i = m(m+1)/2$. Once you’ve proven that, you can use a similar method to prove that $\sum_{i=1}^m i^2 = m(m+1)(2m+1)/6$. Substituting these in and recalling that $m$ in our case is $n - 3$, we have

$$\frac{(n-2)(n-3)(n-3+1)}{2} - \frac{(n-3)(n-3+1)(2n-6+1)}{6}$$

Simplifying that monster,

$$ \begin{align*} \frac{3(n-2)(n-3)(n-2) - (n-3)(n-2)(2n-5)}{6} &= \frac{(n-3)(n-2)(3n - 6 - (2n - 5))}{6} \\\\ &= \frac{(n-3)(n-2)(n-1)}{6}. \end{align*} $$

Multiplying that by $n/4$, we end up with the surprisingly neat formula for the total number of intersections:

$$\frac{n(n-1)(n-2)(n-3)}{4!}.$$

For those who have seen this formula before, this is exactly the number of ways to choose a group of 4 people from a population of size $n$. Once I had spent all this time doing all that algebra, it occurred to me that there had to be an easier way to articulate the solution…

Intuitive approach

The simple way to solve the problem is to use one observation (that we actually made earlier): each intersection corresponds exactly with a single group of four people. If you draw the picture for $n = 4$, you’ll clearly see the single intersection. This observation finishes the problem for us… Can you see how? If we know that each intersection corresponds with exactly one group of four people, that means the total number of intersections must be exactly equal to the number of groups of four people you can choose from a group of size $n$, i.e. $\frac{n(n-1)(n-2)(n-3)}{4!}$.

Conclusion

Neither of the above approaches is better than the other, per se. Each solution requires its own set of deductions and observations, the collection of which leave you with a richer understanding of how you might solve similar problems in the future. In this case, it didn’t even occur to me that there might be a simpler solution until I had already worked hard to find a more involved approach. Enjoy the process! The things you end up understanding the most deeply are the things you can think of and explain from a bunch of different angles. Thanks for reading!

*This is a nice example of a phenomenon in mathematics called symmetry. We say that an object is symmetric under some transformation $T$ if the object doesn’t change when you apply $T$ to it. In our case, notice that in the picture at the beginning of the post ($n = 5$), if you calculate the solution for any particular person $p_1$, you have solved it for all of the others – just rotate the next person you want to solve it for into $p_1$’s position and notice that the problem you’re solving for the new person is of exactly the same form as the one you solved for the first person!

Tale of two distributions

Thu, 02 Aug 2018 00:00:00 +0000

Introduction

A theme of the many posts I’ve written over the last few years is that there are deep and beautiful connections we find between apparently different areas by appealing to a little bit of formalism and some finely-tuned intuition. In this case, the two objects I will connect do not look too different from one another. In fact, they look eerily similar; it just isn’t immediately clear how to connect the dots.

(For this post, I’m going to assume a basic understanding of random variables and probability distributions (more or less just what they are). I’ll also assume basic familiarity with permutations and combinations — just the definitions — and some basic facility with limits. It’s more than I like to assume and I’ll do my best to explain things intuitively as I go along, but if you stick with me, I think it will be well worth your while. This stuff is very, very cool.)

Binomial distribution

Suppose I’m flipping a fair coin. If $X$ is a random variable that takes the value 1 on heads and 0 on tails, we might encode the likelihoods of different outcomes of this experiment as $P(X = 1) = P(X = 0) = 1/2$. More generally, if I have some experiment with two possible outcomes, one of which occurs with probability $p$ (we call this event a success) and the other of which occurs with probability $1 - p$ (we call this a failure), a random variable representing the outcome of the experiment is said to be a “Bernoulli $p$” (henceforth $\text{Bern}(p)$) random variable. The random variable $X$ from the example I opened with is $\text{Bern}(1/2)$, but the general story that describes what sorts of sample spaces might hint at this underlying distribution is the same no matter what $p$ is: I have some experiment with two possible outcomes, one that occurs with probability $p$ and another that occurs with probability $1 - p$.

Let’s make things more interesting. Suppose I repeat this Bernoulli (read: two-outcome) experiment $n$ (read: a bunch of) times, one after the next, where each trial is independent of all the others and ask you: What is the probability that $k$ (read: some number smaller than $n$) of experiments result in a success?

(Think about this and try to work out the answer before reading on.)

As Alon once recently put it, your internal monologue should go something like: “Well… Because my trials are independent of one another, I’m probably dealing with a bunch of probabilities multiplied together. The probability of one success is $p$, so the probability of $k$ successes has got to be $p^k$. The remaining $n - k$ events must be failures, so by similar logic, the probability that the rest of the experiments fail is $(1-p)^{n-k}$. Multiplying these together, I’m pretty sure the answer is $p^k(1-p)^{n-k}$.”

Almost! There’s just one piece you’d be missing. As an example, let’s think about the different ways to get two heads in three coin flips. You can either get HHT, HTH or THH. Each of these has probability $(1/2)^2(1/2)$, but there are three ways to achieve an outcome with that probability, so the probability of getting two heads actually ends up being $3(1/2)^2(1/2)$.

More generally, the probability of any one ordering of experiment outcomes with $k$ successes is indeed $p^k(1-p)^{n-k}$, but to be complete, you need to count up all of the different ways to arrive at $k$ successes in a sequence of $n$ trials. The number of configurations in which $k$ of the $n$ trials are successes is ${n \choose k}$, so the probability of seeing $k$ successes in $n$ independent Bernoulli trials is ${n \choose k}p^k(1-p)^{n-k}$.

The above story, wherein I’m trying to compute the probability of seeing a certain number of successes across a bunch of Bernoulli trials, describes what is called the binomial distribution. (An interesting note here: if you multiply out $(a + b)^n$, you’ll get $a^n + na^{n-1}b + {n \choose 2}a^{n-2}b^2 + \dots + b^n$. See if you can phrase what’s happening there in the language of permutations and combinations.) The binomial distribution is one of the most famous discrete distributions there is and it has a wide range of applications all over probability. Before explaining why I’ve led you here, we need to take a quick detour.

Poisson distribution

While the binomial distribution is just doing its thing, minding its own business, in a galaxy far, far away sits another distribution: the Poisson. The Poisson distribution is usually used to describe spaces in which many trials occur and trial has a very small probability of success. As an example, a Poisson random variable might represent the number of emails you received in the past hour. The likelihood that any one person emailed you during that hour is very low (low probability of success), but the number of people (trials) who could possibly email you is very high.

Deriving the PMF (probability mass function — essentially a formula that allows you to calculate the probabilities of various events) for this is not as easy as it was to do with the binomial, but before we jump in and try to figure out what that might look like, it’s useful to note that our Poisson story bears some interesting similarity to the story we used to define the binomial. In both cases, there are a bunch of trials taking place, each of which has some probability of success and we want to know what the probability of seeing a certain number of successes is. To crystallize this fuzzy-seeming connection, and this is the critical point, what we are trying to capture with the Poisson story is essentially the same thing we captured when we derived the binomial, but specifically when $n$ is large and $p$ is small. Can we formalize this somehow?

The connection?

We can! What we’re going to do first is to define the relationship between $n$ and $p$. Above, we revealed the need to express that $p$ is small and $n$ is large. One way to do that is to enforce that $np$ be constant (there are other ways, but this is the one that will be useful here); we’ll name that constant $\lambda$. Why does this help us? Well, if I pick a value of $\lambda$, say 1, before I start, then once I choose $n$, I’ve determined $p$. As an example, if $n = 100$, then because $np = 1$, $p$ must be 1/100. Furthermore, as $n$ gets larger we see that $p$ has to get smaller to keep lambda constant, so if we let $n$ tend to $\infty$, we are implicitly forcing $p$ to tend to 0. Being able to write $p$ in terms of $n$ is going to be very helpful in what’s to come.

With the relationship between $n$ and $p$ specified more precisely, we will now see what happens to the binomial PDF as $n$ grows. That is, we want to compute

$$\lim_{n \to \infty} {n \choose k} p^k(1 - p)^{n - k}.$$

Because $p = \lambda / n$ we can rewrite all the $p$s in the above limit as $\frac{\lambda}{n}$s:

$$\lim_{n \to \infty} {n \choose k} (\frac{\lambda}{n})^k(1 - \frac{\lambda}{n})^{n - k}.$$

Simplifying a little bit, the limit can be rewritten as

$$\lim_{n \to \infty} \frac{n!}{k!(n-k)!} \frac{\lambda^k}{n^k}(1 - \frac{\lambda}{n})^{n - k} = \lim_{n \to \infty} \frac{n(n-1)(n-2)\dots(n-k+1)}{n^k} \cdot \frac{\lambda^k}{k!} \cdot (1 - \frac{\lambda}{n})^{n - k}.$$

We will look at each of term of the product one at a time. Provided that their limits all exist, we can glue them together. For the leftmost multiplicand, noting that $\frac{n-m}{n}$ for $m$ constant tends to 1 as $n$ tends to $\infty$, we observe that we have $k$ numerator-denominator pairs that all tend to 1. Multiplied together this shows us that the leftmost term tends to 1. We can pull $\frac{\lambda^k}{k!}$ out of the limit because $k$ is constant. Combined with our analysis of the first term, we need still need to solve

$$\lim_{n \to \infty} (1 - \frac{\lambda}{n})^{n - k}.$$

Because $k$ is constant and $n$ is shooting off to $\infty$, $n-k$ behaves the same as $n$ in our case, so we can rewrite the limit as

$$\lim_{n \to \infty} (1 - \frac{\lambda}{n})^n.$$

This limit is a thinly veiled version of a common exercise from standard first semester calculus courses. I challenge you to convince yourself that

$$\lim_{n\to \infty} (1 + \frac{a}{n})^{bn} = e^{ab}.$$

(Hint: Start by setting $y = \lim_{n\to \infty} (1 + \frac{a}{n})^{bn}$ and taking the natural log of both sides.)

In our case, $a = -\lambda$ and $b = 1$, so our last piece evaluates to $e^{-\lambda}$. Gluing our pieces together, we have the PMF of the Poisson distribution: if $X$ is a Poisson random variable, $P(X = k) = \frac{e^{-\lambda}\lambda^k}{k!}$. (Check Wikipedia, I dare you.)

Conclusion

When I saw the above, I thought it was a really nice use of mathematical formalism to ground a connection we first noticed more intuitively. I find that it is often in finding footing for these sorts of beautiful connections that math shines brightest.

(This post was inspired by Joe Blitzstein’s Stat110 course on YouTube. Your course is awesome and I’ve learned a ton. Thank you.)

The birthday problem

Mon, 28 May 2018 00:00:00 +0000

Introduction

After writing a post about the Monty Hall problem the other day, a friend of mine asked if I’d write one about another famous, counterintuitive probability problem known as the birthday problem. The problem asks a simple(-seeming) question: determine the smallest number of people who must be in a room in order for there to be a 50% chance that two of them share a birthday. (Assume every year has 365 days.)

People typically hear the words “fifty percent” and “birthday” and think something along the lines of: “If there are 365 possible birthdays, then I probably would need about 365/2 people. That’s… **stares up and to the right for a second**… 183 birthdays!” These very people are usually shocked to hear that the solution is far smaller than that, and the rest of this post will show how to calculate it using some probability.

Using probability theory

While the following observation is obvious, it unlocks a new way of solving a whole trove of probability problems: when we consider an event $A$, it either happens or it doesn’t. (We refer to the “it doesn’t” event as $A^C$, read “$A$ complement”.) Thus if the event $A$ happens with a certain probability $p$, then $A$ does not happen with $1 - p$. This allows us to write $P(A)$ in terms of $P(A^C)$ and vice versa. For our purposes, we encode this logic as $P(A) = 1 - P(A^C)$.

To solve the problem at hand, we need to come up with an expression for the probability that two people share a birthday in a room of $k$ people. If we call the aforementioned event $A$, we want an expression for $P(A)$. As per the above, if we can come up with an expression for $P(A^C)$, then we’ve effectively determined our expression for $P(A)$. In our case, the event $A^C$ is the event that in a room of $k$ people, no two share a birthday. The probability of that event is given by:

$$P(A^C) = \frac{\text{number of ways to assign unique birthdays to } k \text{ people}}{\text{number of ways to assign birthdays to } k \text{ people}}$$

For the numerator, if we have are looking at a room with 10 people in it, we have 365 birthdays to choose from for the first, 364 for the second, and so on, until you’ve assigned birthdays to all 10 people. The total number of ways is thus $365 \cdot 364 \cdot \dots \cdot 356$.

For a room with k people, we can generalize this to $365 \cdot \dots \cdot (365 - k + 1)$, so that expression is our numerator. To compute the denominator, we are thinking about a more relaxed version of the same assignment problem we solved to compute the numerator. When I say more relaxed, I mean that we don’t have to worry about assigning the same birthdays to people anymore. In our room of 10 people, this means we have 365 birthday choices for the first person, but we also have 365 for the second, third, fourth, fifth and so on. Thus, the total number of ways to assign birthdays to 10 people is $365^{10}$. For a room with $k$ people, this turns into $365^k$.

Now that we have the numerator and denominator, we can write down $P(A^C)$:

$$P(A^C) = \frac{365 \cdot \dots \cdot (365 - k + 1)}{365^k}$$

With this in hand, we can now use the fact that $P(A) = 1 - P(A^C)$ to express

$$P(A) = 1 - \frac{365 \cdot \dots \cdot (365 - k + 1)}{365^k}.$$

Now we just start trying values of $k$ until $P(A) \geq 1/2$. This first occurs when $k = 23$.

Intuition

One explanation for why this number is so intuition-defyingly low is that when people first think about this problem, they usually think that $k$ has to do with the number of people in the room, instead of noticing that the key number in this problem is actually the number of pairs of people in the room. While 23 people doesn’t seem like much, 23 people furnishes you with 253 pairs! Each pair has probability 364/365 of not having the same birthday, so the probability that none of the 253 pairs shares a birthday is

$$(\frac{364}{365})^{253} \approx 0.4995.$$

This means that the probability that some pair shares a birthday is approximately $1–0.4995 = 0.5005$. To see that the number of pairs is what matters, note that if we increase $k$ to 50, the probability that some pair shares a birthday is roughly 97%. By the time you have 75 at your party, the probability jumps to 99.9%.

Conclusion

As was the case with the Monty Hall problem, the birthday problem serves as an example of the way that a rigorous analysis is a great way to combat our sometimes errant intuition.

The Monty Hall paradox

Thu, 24 May 2018 00:00:00 +0000

Introduction

The field of probability is rife with counterintuitive results that show how necessary the rigor of mathematics is to correct understanding of certain situations. This post will be about the Monty Hall Problem. It isn’t hard to state, but the result is somewhat subtle, so I thought it’d be fun to write about.

The problem

The parlor-trick version of the problem goes as follows: You are on a game show and in front of you are three doors (labeled 1, 2 and 3). Two conceal goats and one hides a car. The car has a 1/3 probability of initially being behind each door. Your host, Monty, knows which door the car is behind. If you pick the door with the car behind it, you win. Monty asks you to select a door and you choose door 1. Monty then opens one of the two remaining doors (door 3, say), revealing a goat, and then asks you if you want to change your selection to door 2. Does it pay to take him up on his offer? Think about what you would do before continuing to read. What is your intuition telling you?

At first, many people are ambivalent. They argue that because I have two doors left, it’s equally likely that the car is behind door 1 as it is behind door 2. Thus, switching neither hurts nor helps. This argument isn’t quite correct, though; it doesn’t make use of information that Monty provided you with by opening one of the doors for you! It turns out that some knowledge of conditional probability would greatly increase your chances of going home with that new car. Let’s see why.

Law of total probability

Let’s see what a probabilistic argument tells us. From the fact that Monty opened door 3, we know that the car has to be behind door 1 or door 2. Let $S$ represent the event that the switching strategy wins the game. I think the simplest argument is the one that makes use of the Law of Total Probability (LoTP), which we will simplify to: Given a partition of a sample space (a set of events that are disjoint from one another and together make up the whole sample space) $B, B^C$, $P(A)$ (for $A$ in the same probability space as $B$) can be written as $P(A) = P(A|B)P(B) + P(A|B^C)P(B^C)$ (Note: $B^C$ is the complement of $B$. $B^C$ occurs if $B$ does not). (The above can be naturally generalized to an infinite partition of the sample space.) For an intuitive idea of what this means, this picture is helpful:

As you can see, part of $A$ intersects with $B$ and the other part intersects with $B^C$. The LoTP basically tells us that one way of computing $P(A)$ is to add up the probabilities that A occurred given that either (1) $B$ occurred, or (2) $B$ did not occur (i.e. $B^C$ occurred). Because $B$ either occurred or didn’t, this sum has to give the total probability of $A$. We are going to apply similar logic to our problem.

As is obvious, at the outset, the car is either behind door 1, door 2 or door 3. Let $D_i$ be the event that the car is behind door $i$. Because the $D_i$ are a partition of the sample space, I can use the LoTP, so that

$$P(S) = P(S|D_1)P(D_1) + P(S|D_2)P(D_2) + P(S|D_3)P(D_3)$$

At the beginning of the game, the car had an equal likelihood of being behind any of the three doors, we can fill in the right multiplicands in the sum above:

$$P(S) = P(S|D_1)\cdot \frac{1}{3} + P(S|D_2)\cdot \frac{1}{3} + P(S|D_3)\cdot \frac{1}{3}$$

Next, observe that $P(S|D_1)$ is the probability that switching wins given that the car is behind the door you initially chose. This probability is 0, because you would be switching away from the winning door. If the car is behind door 2 (note that in this case, door 3 was opened and is no longer in contention), switching always gets you the car, i.e. $P(S|D_2) = 1$. The same is true if the car is behind door 3 (and door 2 was opened). Hence

$$P(S) = 0 \cdot \frac{1}{3} + 1 \cdot \frac{1}{3} + 1 \cdot \frac{1}{3} = \frac{2}{3}.$$

In other words, switching wins the game for you with probability $2/3$!

Intuition

Let’s see if we can understand this a bit better without appealing to symbols. When you first chose door 1, you had a $1/3$ chance of winning the car. Stated a different way, you had a $2/3$ chance of not getting the car. When Monty opens one of the unchosen doors and reveals a goat, he is, in effect, providing you with new information. The probability of your initial choice being correct is still $1/3$, but with your updated understanding of the world, the $2/3$ probability that the car is not behind your initial choice of door is all resting on the single remaining door.

For a more extreme application of this line of reasoning, consider the a similar problem in which you start with 100 doors to choose from. You choose door 1 and Monty opens 98 of the remaining doors, revealing goats. Originally, your choice held a $99/100$ probability of being incorrect. Initially, that $99/100$ was spread evenly over the 99 doors you didn’t select. With each door Monty opened, that $99/100$ was condensed to fewer and fewer doors, so that by the very end, there is a single door holding all of that probability that you were initially wrong. You’d be crazy not to switch.*

Conclusion

The Monty Hall paradox is a fun, relatable problem that is a terrific example of the need for formality in thinking about probabilities. Our intuition is a powerful tool, but for as many problems as it allows us to solve, it yet more often leads us astray. In such moments it is crucial to have a systematic, rigorous approach with which to check ourselves and make sure that we remain on sound logical footing.

*For the more symbolically inclined reader, I want to quickly use the logic from the simple cases to show that it in fact always pays to switch doors. Suppose we generalize the problem so that there are $n\geq 3$ doors and after you select one of them, Monty opens $m$ of the remaining doors ($1 \leq m \leq n - 2$) and then asks you if you want to switch. In this case, after opening $m$ doors, the $(n-1)/n$ probability that you were initially wrong gets evenly distributed across the $n - 1 - m$ remaining doors, so that each door carries a probability of $(n-1)/(n(n - 1 - m))$. We complete the argument by verifying that this quantity is always bigger than $1/n$. To do this, note that showing $1/n < (n-1)/(n(n - 1 - m))$ is the same as showing $1 < (n-1)/(n - 1 - m)$, which is clear.

The mean value theorem

Tue, 22 May 2018 00:00:00 +0000

Introduction

Derivatives are used across many different fields of engineering, physics and mathematics to analyze the ways that continuous quantities change. Although the definition of derivative that we talk about in calculus courses today works well, it wasn’t always so simple. Coming up with a way to talk about derivatives where we could both understand it intuitively as well as give ourselves the right machinery to prove useful results about them took centuries and a bunch of mathematical legwork that I thought would be worth exposing a small part of.

In particular, in this post I want to define derivatives and lead up to the proof of the Mean Value Theorem. It and its generalizations are some of the most important and useful results about derivatives that we have and I thought that proving the most elementary version would be accessible and somewhat fun!

(Some small amount of calculus is assumed, but it isn’t strictly necessary.)

Without further ado, here we go!

Defining the derivative

The first thing we need to do is come up with a rigorous and useful definition of a derivative. To help draw this intuitive picture, imagine the following. You have two points on the $x$-axis that are pretty close together. We’ll call one point $x$ and the other $c = x + \text{a little bit}$. If our curve is given by $f(x)$, we can denote the slope of the line through $(x,f(x))$ and $(c, f(c))$ as the change in $y$ values divided by the change in $x$ values (aka rise over run). In other words, the slope $m$ of the line would be given by

$$m = \frac{f(c) - f(x)}{c - x}.$$

This isn’t anything new. You’ve known how to calculate slopes since middle school. The question of slope becomes more difficult, however, when you try to modify the usual notion of slope of a line through two points to come up with an analogous description for the slope of tangent line to a curve at one point.

(Note: I will assume for the remainder of this post that you have some familiarity with limits and continuity. If you don’t, you might want to browse the web and review those briefly. Even if you don’t have the requisite background, I’ll do my best to explain the concepts in a non-technical, intuitive way.)

To do this, we have to use some stuff from introductory calculus. Intuitively, what we’re going to do is let “a little bit” get smaller and smaller, so that $x$ and $c$ get closer and closer together. As the distance between them gets smaller and smaller, the slope of the secant line through $(x,f(x))$ and $(c,f(c))$ approximates the slope of the tangent line at $x$. Put succinctly, the slope of the tangent line of a curve at a point $c$ is the limit of the slopes of secant lines between $(c,f(c))$ and points $(x, f(x))$ where $x$ gets arbitrarily close to $c$. Put mathematically

$$f’(c) := \lim_{x \to c} \frac{f(c) - f(x)}{c - x}.$$

We say that the function $f$ is differentiable at $c$ if the limit above exists. This idea of differentiability at a point can be extended naturally to a set of points.

In a typical real analysis course, after laying definitions down, you get familiar with the definitions by proving some of the usual facts about derivatives. These are things like $(f + g)’(c) = f’(c) + g’(c), (kf)’(c) = kf’(c)$ ($k$ is a constant and $f, g$ are differentiable at $c$). I’ll leave proving these and other facts like the power, product, quotient and chain rules as an exercise for the inclined reader in favor of venturing into more interesting territory.

In particular, I want to lead up to a proof of the mean value theorem, one of the most fundamental and useful facts about derivatives. It’s one of those theorems that looks sort of obvious when you just draw some pictures (a la intermediate value theorem), but it’s actually a pretty deep result. It and its generalizations are some of the the most useful tools mathematicians have had at their disposal to tackle derivatives since their inception in the 17th century. We’ll get to our main result via a few intermediate ones, starting with the interior extremum theorem (IET).

The interior extremum theorem

The interior extremum theorem tells us something about the connection between a function’s extreme values and points at which the derivative vanishes:

Interior Extremum Theorem: Let $f$ be differentiable on the open interval $(a,b)$. Then if $f$ attains a maximum (or a minimum) on $(a,b)$ at some point $c$, then $f’(c) = 0$.

Proof: Because $c$ is in the open interval $(a,b)$, we can find two sequences, $x_n$ and $y_n$, such that both converge to $c$ and $x_n < c < y_n$ for all $n$. Given these sequences, we have

$$f’(c) = \lim_{n \to \infty} \frac{f(c) - f(x_n)}{c - x_n} \geq 0,$$

because $f(c) - f(x_n)$ is nonnegative (because $f(c)$ is a maximum value of $f$ on $(a,b)$) and $c - x_n$ is positive for all $n$ (because $x_n < c$). On the other hand, we also see that

$$f’(c) = \lim_{n \to \infty} \frac{f(c) - f(y_n)}{c - y_n} \leq 0,$$

because the numerator is again nonnegative, but this time the denominator is negative for all $n$. Thus, we have $0 \leq f’(c) \leq 0, $ so $ f’(c) = 0$. QED.

Note that the converse is not necessarily true. It isn’t necessarily the case that if the derivative is 0 at a point $c$, then $f(c)$ is maximum or a minimum of $f$ on $(a,b)$. Consider the function $f(x) = x^3$ on the interval $(-1,1)$. By the power rule, $f’(x)=3x^2$ (and it is defined on $(-1,1)$). At $x = 0$,$ f’(x) = 0$, but $-1 = f(-1) < f(0) < f(1) = 1$, whence $f$ takes neither a minimum nor a maximum value at 0 even though the derivative vanishes there. Although its converse doesn’t hold, the IET does furnish us with a pretty powerful computational tool with which to solve optimization problems, the solutions to which often begin with “First take the derivative of the function you want to optimize and find the $x$ values at which it equals 0…”.

Rolle’s theorem

Next, I want to use the IET to prove another result that isn’t so hard to convince yourself of with pictures. It’s called Rolle’s theorem. It tells us that if f takes the same value at two ends of an interval and is differentiable on said interval, that there must be some point within the interval at which the derivative is zero. I actually think it’s worthwhile to draw some pictures and convince yourself intuitively that this result makes sense before you read the proof below:

Rolle’s Theorem: Let $f$ be continuous on $[a,b]$ and differentiable on $(a,b)$ with $f(a) = f(b)$. Then there is a point $c \in (a,b)$ such that $f’(c) = 0$.

Proof: If $f$ is constant on $(a,b)$, then $f’$ is identically zero on that interval, in which case there’s nothing to prove. If $f$ is non-constant on the interval, f takes on a maximum or minimum value within $(a,b)$. By applying IET, we’re done. QED.

The constant case above is simple. In the non-constant case, what we’re basically saying is that starting at $f(a)$, the value of $f$ rises (or falls) as the $x$ values move to the right. At a certain point, though, they need to start falling back down (rising back up) to the value $f$ took at $a$. The point at which this fall (rise) happens is the point we sought.

Mean value theorem

With these results, we are now equipped to prove the mean value theorem. We will do this by reducing it to a simple application of Rolle’s theorem.

Mean Value Theorem: Let $f$ be continuous on $[a,b]$ and differentiable on $(a,b)$. There is a point $c \in (a,b)$ such that $f’(c) = \frac{f(a) - f(b)}{a - b}$.

(Note that this is a more general version of Rolle’s theorem. Draw some pictures and convince yourself of the theorem before reading the proof below.)

Proof: Let’s first write down what the equation of the line through $(a, f(a))$ and $(b, f(b))$ would look like. On the one hand, we know that the slope of the line is $\frac{f(a) - f(b)}{a - b}$. On the other, because we are talking about a line, that the slope of the same line through any $(x, y(x))$ and $(a, y(a) = f(a))$ should be the same, so we have $\frac{f(a) - f(b)}{a - b} = \frac{y(x) - f(a)}{x - a}.$ We can get a general equation of the line by solving for f(x) (multiply both sides by $x - a$ and then add $f(a)$ to both sides). Doing so, we see that

$$y(x) = \frac{f(a) - f(b)}{a - b}(x - a) + f(a).$$

In order to transform this into an instance of Rolle’s theorem, what we’re going to do is build a new function $d$ that represents the differences between the curve and the line. Namely, we have

$$d(x) = f(x) - y(x) = f(x) - \biggr[ \frac{f(a) - f(b)}{a - b}(x - a) + f(a) \biggr].$$

Clearly, $d(a) = d(b) = 0$. By the rules about combining continuous functions, $d(x)$ is continuous on $[a,b]$ and by rules for combining differentiable functions, $d(x)$ is differentiable on $(a,b)$. This means that we can apply Rolle’s Theorem to $d(x)$ so that we have a $c \in (a,b)$ such that $d’(c) = 0$. That is, there is a $c$ such that $d’(c) = f’(c) - \frac{f(a) - f(b)}{a - b} = 0$. Rearranging a bit, we see that at that $c$, we have $f’(c) = \frac{f(a) - f(b)}{a - b}$, proving the theorem. QED.

Conclusion

Several of the important results (e.g. L’Hosptial’s Rule for solving limits) that you learn about later on in calculus and real analysis courses use the Mean Value Theorem as their driving force. I haven’t seen many theorems that are both simple to conceptually grasp and also fundamental building blocks of important areas of mathematics. I think the derivative is a phenomenal example of the power and necessity of good definitions and the type of ingenuity that appears all across mathematics, albeit more subtly sometimes.

Two puzzles from Martin Gardner

Wed, 21 Feb 2018 00:00:00 +0000

Introduction

In this post I thought I’d write about two fun mathematical puzzles I came across recently (in Martin Gardner’s book Mathematical Puzzles). Neither requires much mathematical sophistication, but they are both good examples of how the ability to think logically and model a problem technically are important problem-solving tools.

Round trip

Problem

An airline runs a round trip from city A to city B and back. The plane travels at a single constant speed. During one trip, there is no wind blowing during either leg. During a second trip, there is a constant wind that blows from A to B during both legs of the trip. Is the first trip longer than, shorter than, or the same length as, the second trip? (Take a minute to think about it before reading the solution.)

Solution

Let $d$ be the distance from A to B. If $r \cdot t = d$ (rate x time = distance), then $t = d/r$. What we are going to do is write equations for $t$ in each of the wind and no-wind cases and see if we can determine the relationship between them. In the no-wind case, if the plane’s constant speed is given by $r$, then $t_1 = 2d / r$. In the wind case, let’s say the wind speed is given by some $w > 0$. Then the plane travels at a rate of $r + w$ on the way from A to B and at a rate of $r - w$ on the way back. Thus, the total round trip time is given by

$$ \begin{align*} t_2 &= \frac{d}{r + w} + \frac{d}{r - w} \\\\ &= \frac{d(r - w) + d(r + w)}{r^2 - w^2} \\\\ &= \frac{2dr}{r^2 - w^2} \end{align*} $$

Notice that because $w > 0$, $w^2 > 0$, so that

$$ t_2 = \frac{2dr}{r^2 - w^2} > \frac{2dr}{r^2} = \frac{2d}{r} = t_1, $$

showing us that the trip with wind takes longer.

Cornerless chessboard

Problem

You have an 8x8 chessboard and 32 2x1 dominoes. As is hopefully clear, you can cover all 64 squares on the chessboard with the 32 dominoes. Now suppose I remove two opposite corners from the board and take away one domino. Can you cover the 62 remaining squares with your 31 remaining dominoes? If so, show how. If not, prove it.

Solution

You cannot cover the remaining squares. To see why, the key observation is that each domino covers one black square and one white square. If you remove opposite corners, you are removing two squares of the same color. In order to cover what’s left, you would need to cover 30 black squares and 32 white squares, but per our observation, 31 dominoes can only cover 31 black squares and 31 white squares! Thus, covering the remaining squares with 31 dominoes is indeed impossible.

Conclusion

While delving into deep higher level mathematics is certainly fun, it’s fun to pause every so often and have a little fun with some less-involved puzzles; I hope you’ve enjoyed :)

Distributed hash tables

Wed, 10 Jan 2018 00:00:00 +0000

Introduction

While watching some lectures on distributed/cloud computing, I came across distributed hash table, which is a way to implement a hash table (if you’re unfamiliar, see here) distributed across a bunch of different servers connected by a network. The goal is to implement the table so that (1) finding a key’s location is “easy” (read: efficient) and (2) the user does not have to worry about the underlying network topology. In other words, to the client, using the distributed table should feel like they are using one in-memory on a single machine. I chose to write about this because it’s a place where theory gracefully translates itself well to practice. It turns out that by using a mathematically fun and interesting hashing technique and a clever data structure in concert, we can achieve a pretty efficient distributed hash table.

Hash tables

For those unfamiliar with the basic idea of hashing and hash tables, imagine that you have a bunch of objects that you need to store. You have a bunch of cabinets in which to store said objects. You’d ideally like to put the items into the cabinets in such a way that when you want to retrieve a particular item from storage, you can do so quickly. To help you accomplish this, I give you a crystal ball which, based on some combination of the object’s characteristics, determines which cabinet to put it in/retrieve it from. For example, if the object you’d like to store is red, spherical and manufactured by Hasbro, your crystal ball might assign it cabinet #1 (note: a crucial property of this crystal ball is that if it says cabinet #1 for a particular item, it will always say cabinet #1 for that item; it won’t change its opinion).

Then, if at some later point you want to retrieve the red ball manufactured by Hasbro, you would consult the crystal ball I gave you to figure out where you put it. For this to work well and grant the efficiency you seek, you wouldn’t want too many objects landing in the same cabinet; if the crystal ball assigned several items to cabinet #1, for instance, then when your crystal ball reveals that the red Hasbro ball is in cabinet #1, you would have to look at all of the items that landed there to find it. In the worst possible case, if every object somehow landed in cabinet #1, you would potentially have to rifle through every item you deposited. As the number of items you need to store gets larger and larger, the inconvenience of items in the same cabinet could range anywhere from mildly annoying to intractable.

Before we continue, I want to attach some technical terminology to the basic aspects of hashing just discussed. In the above example, hash table itself is the set of cabinets you have for storage. We will let the number of cabinets in our table be denoted by m and we use n to denote the number of items we have to store. The items are called keys. The crystal ball you consulted in order to know where to put/find each item is known as a hash function. The easiest way to think of a hash function is as a mathematical function that takes some object in and uses its properties to deterministically output an integer. One of the properties we care about when we choose hash functions to suit our applications is whether they assign lots of different keys to the same buckets. If keys $k_1$ and $k_2$ are mapped to the same bucket by the hash function, we say that the function collides on these keys. (As a note before we get into the specifics of the hashing technique we’ll use to build our distributed table, any algorithm that uses hashing usually relies on the assumption about hash functions (see here) which basically guarantees that the probability that a key lands in any particular bucket is $1/\#\text{buckets}$. More technically, the assumption asserts that the hash function distributes keys uniformly. We can actually show mathematically that if the hash function distributes keys uniformly, we expect any particular cabinet to have $n/m$ items in it, implying that in the worst case, we expect that the runtime of a lookup/insert/update/delete is $O(n/m)$. Provided that we choose $m$ close to $n$, we effectively expect that the aforementioned operations take constant time (read: are extremely fast).)

Complexities of a distributed network

Now let’s say I have a bunch of files that I want to store across a collection of machines. For this particular example, say I have 3 servers available labeled 0, 1, and 2. The simplest way to store the files across my servers is to take the hash of the file I want to store (an integer), compute its remainder when you divide by 3 (note that this value can only be 0, 1 or 2) and stash the file in the server (“cabinet”) with the corresponding label. To find out which server a file $f$ has been stored on, simply compute $\text{hash}(f) % 3$ (this means the remainder that $\text{hash}(f)$ leaves when divided by 3) and ask that server for the file.

This technique seems all well and good, until you consider one of the realities of distributed systems: that machines (machine will henceforth be interchangeable with “node” or “server”) join and leave the system all the time. Can you see the challenge this presents? What happens if I stored a file on server 1 and then I ask my hash function where it is? See if you can spot the problem before continuing.

The problem is if a new node joins or leaves the system, the hash function might give me the wrong answer. Let’s say that for some file $f$, $\text{hash}(f) = 40$. If I stored $f$ when my system only had 3 nodes, my algorithm would have placed $f$ on node 1, because $40 / 3 = 13$ remainder $1$. Now let’s say I add a 4th node and subsequently ask where $f$ is. Well $\text{hash}(f) = 40$, and $40 / 4 = 10$ remainder $0$! But from earlier, we saw that $f$ is actually sitting on node 1! How might we solve this problem? Think about this before reading the next paragraph.

We can solve this problem by rehashing every file given the new number of servers whenever the system registers a new machine (or some machine leaves). This solution, however, is very computationally cumbersome. Each time a new node registers (or leaves), the system has to stop serving clients, compute the new hash of every single file in the system and move the files to their new homes. (This can be somewhat mitigated by storing metadata about files instead of the files themselves on the servers in the system, but given enough files, even moving around all the metadata would be pretty slow.)

Can we avoid this pitfall somehow? The answer, as you might’ve guessed, is yes, and the technique is called consistent hashing.

Consistent hashing

In consistent hashing, we pick some integer $s$ and imagine a logical ring with $2^s$ discrete slots labeled $0, 1, 2,\dots, 2^s - 1$. For example, if $s = 3$, then our logical ring would have slots labeled $0,1,\dots, 7 $($= 2^3 - 1$). Ideally we choose an $s$ big enough that $2^s$ is a lot bigger than the number of nodes we expect.

(For the remainder of these steps when I talk about locations on the ring, I’m not talking about an actual ring. If you’re familiar with modular arithmetic, all we’re doing is wrapping hashes around a modulus. If you’re thinking of it as an actual ring, that’ll work too.)

First we place the servers on the ring by hashing their (ip address, port) pairs and taking remainders modulo $2^s$. If for some server $A$, $\text{hash}(A)$ gives a remainder of $a$ modulo $2^s$, it would logically occupy slot $a$ on the ring.

Next, in order to make accesses fast, we introduce a clever data structure called a finger table at each node. Each finger table maintains $s$ pointers to other nodes on the ring as follows. For $i$ between $0$ and $s - 1$ (inclusive), the $i$th finger table entry for a node at ring position $N$ is the first node whose position is greater than or equal to $(N + 2^i) \pmod {2^s}$. Let’s look at a quick example.

Suppose that $s = 5$ and we had nodes at positions 3, 7, 16, and 27 on our logical ring. To compute 3’s finger table, we compute

$$ \begin{align*} 3 + 2^0 &= 3 + 1 &&= 4 &&&\pmod{32} \\\\ 3 + 2^1 &= 3 + 2 &&= 5 &&&\pmod{32} \\\\ 3 + 2^2 &= 3 + 4 &&= 7 &&&\pmod {32} \\\\ 3 + 2^3 &= 3 + 8 &&= 11 &&&\pmod {32} \\\\ 3 + 2^4 &= 3 + 16 &&= 19 &&&\pmod {32} \end{align*} $$

The first node larger than or equal to the first entry in the finger table is 7, so the first finger pointer is to 7. The same goes for the second and third entries. The fourth entry would point to 16 and the final entry would point to 27. So for node 3, the finger table would look like $(7, 7, 7, 16, 27)$.

Each node also maintains pointers to the nodes to its right and left along the ring. The successor of 3 is 7 and the predecessor of 3 is 27. In a similar vein, can more generally refer to predecessors and successors of any slot on the ring.

To perform a lookup, we make use of both the logical ring we introduced and the per-node finger tables. Let’s say that machine 7 wants to make a query for a key that hashes to 2 (and would thus reside on node 3 — the first node to its right on the ring). It would look in its finger table for the largest node that is to the left of the key’s position — this would be node 27 — and then route the query there. Node 27 would then route the query to its successor (because it is greater than the key we’re looking for), which would then be able to send the data associated with the requested key back to 7.

(In this example the ring is small and we don’t require that many hops. This won’t always be the case. For instance, if $s = 10$ and a query initiated at 3 for a key that hashed to 999, the largest node to the left of the key that 3 knows about would be at most $3 + 2^9 = 3 + 512 = 515 \pmod{1024} (=2^{10})$, but there might be a node at 997 that is much closer that 3 doesn’t have in its finger table.)

Why do this?

The last things I want to briefly discuss are the benefits this whole complicated system confers. The first thing we get is relatively fast lookup time. We can actually prove that (with high probability) lookups are logarithmic. In essence, what this means is that on every hop that we take using the finger tables, we traverse at least half of the remaining distance between the current node and the node containing the desired key (the key’s successor). A sketch of the proof goes like this:

Suppose we are at node $n$ and the node immediately to the left of a key $k$ is some node $p$. According to the algorithm outlined above, node $n$ will search its table and elect to move to the largest node it knows about that is to the left of $k$. Call this node $m$. If $m$ is the $i$th entry of $n$’s finger table, then because $p$ is necessarily at or to the left of $m$, both $m$ and $p$ are between $2^{i-1}$ and $2^i$ away from $n$. This, in turn, means that $m$ and $p$ are at most $2^{i-1}$ away from one another (because $2^i - 2^{i-1} = 2^{i-1}$). Thus, the distance between $m$ and $p$ (at most $2^{i-1}$) is at most half the distance from $n$ to $m$ (at least $2^{i-1}$), which is what we wanted.

Using this halving result, we can note further that after $t$ hops, the number of slots we haven’t yet searched is the $(\text{total number of slots}) * (1/2)^t = 2^s/2^t$ (because we halved the distance between where we started and our destination $t$ times). If we make $\log n$ hops (where n is the number of nodes in the system), we have $2^s/2^{\log n} = 2^s/n$ slots left to search. Because we assumed that the hash function we chose distributed nodes uniformly about the ring, we only expect there to be 1 node in this window. That there are $\log n$ such nodes has an even smaller probability! This means that after $\log n$ finger table hops, we will have to (with high probability) make at most $\log n$ more hops to get to our destination. If you’re familiar with big-Oh notation, the total runtime for a lookup in the worst case is, with high probability, $O(\log n + \log n) = O(2\log n) = O(\log n)$, as we wanted.

As a final note, insertions, deletions, updates are bottlenecked by the lookup operation. Once we know we can expect logarithmic lookup, we automatically know that those other operations are logarithmic too.

Conclusion

For those not familiar with what logarithmic runtime means, let’s just say it’s pretty fast. The other cool thing we get out of this logical ring, which is potentially even more important when you’re talking about systems that frequently gain and lose machines, is that you don’t have to move so many of the keys around when a new node joins or leaves. For example, if a node leaves, you only have to update the finger tables of nodes that used to point to the node that left. (Can you figure out what you would have to do if a node joined?)

I think this is a great example of the way that algorithms and mathematical reasoning are a huge part of the push toward more scalable system architectures. If they aren’t using this exact algorithm, engineers at Google, Facebook, Amazon, Netflix and others are using similar ideas to push the boundaries of what it means for distributed systems to be available, scalable, efficient and maintainable.

Fundamental theorem of arithmetic

Wed, 10 Jan 2018 00:00:00 +0000

Introduction

Often when I decide to write a post about some theorem or concept, the best are those that are both deep and easy to explain. These are admittedly hard to come by, but upon doing a bit of review of some basic number theory (study of properties of whole numbers), I stumbled across the Fundamental Theorem of Arithmetic (FToA) and thought that it was an almost perfect candidate.

The FToA is about the atomic nature of prime numbers, which, for those unfamiliar, are numbers whose only divisors are themselves and 1. The FToA basically tells us that each whole number is made up of some unique product of primes. There are proofs littered across mathematics that make use of either or both of the existence of such a decomposition and its uniqueness. For such a useful theorem, the proof is quite accessible and I thought it was worth writing about, so here we go.

Proving it

The theorem can be stated as follows:

Theorem: Every positive whole number $n > 1$ can be written as a unique product of prime numbers.

Proof: The proof has 2 parts. We will first show that the decomposition exists and then we will show that it’s unique. For existence, we will use induction. The base case, when $n = 2$, is trivial. 2 is the product of… well… 2. So now we assume that $n > 2$ and then every number $1 < k < n$ has such a decomposition. If $n$ is prime, we’ve succeeded (same logic we used for the base case). If $n$ is composite, then we can write $n = ab$ with $a$ and $b$ both strictly smaller than $n$. By the induction hypothesis, both $a$ and $b$ have prime factor decompositions, so $n$ does as well.

(If you aren’t familiar with induction, what we’ve done is shown that in the very smallest case, we have what we want. We’ve also shown that if what we want to prove holds for $2..n-1$, it also holds for $n$. Thus, if it holds for 2, it holds for 3. If it holds for 2 and 3, it holds for 4. If it holds for 2, 3 and 4, it holds for 5. Continuing this way forever, we see that every possible $n$ has the property we want.)

Now, for uniqueness. The way we typically show uniqueness in math is by supposing that there are 2 distinct versions of whatever it is we think is unique and showing that they must actually be the same. This is the technique we employ here. Suppose that we could write n two ways. That is, suppose we could validly write both of $n = p_1^{e_1}p_2^{e_2}\dots p_k^{e_k}$ and $n = q_1^{f_1}q_2^{f_2}\dots q_m^{f_m}$ where the $p_i$ (distinct from one another) and $q_j$ (distinct from one another) are prime and the exponents are all positive. Notice that the first factorization has $k$ primes, the second has $m$ and that the exponents are not necessarily all the same (yet). We are truly assuming that we have 2 factorizations that are, at least initially, potentially completely different from one another. If we can show that $k = m$ and that $e_i = f_i$ for each $i$, then we’ve accomplished our objective.

We can assume, without loss of generality, that the $p_i$ are in increasing order (if they’re not, we can relabel them so that they are without affecting any part of the proof). Let’s look at $p_1$. It must divide one of the $q_j$ (this stems from the fact that if a prime divides a product of numbers, it must divide at least one of the numbers — this is not hard to prove using induction… try it?). Let’s say that $p_1|q_j$ for some $j$. Reorder the $q_j$ so that $p_1|q_1$. Because both $p_1$ and $q_1$ are prime, this means that $p_1 = q_1$. So divide each factorization by $p_1$ and $q_1$ respectively and repeat this process until you run out of primes in one of the decompositions. If one of the factorizations runs out before the other, then we will have written 1 as a product of primes greater than 1, which is impossible. They must thus run out at the same time, whence $k = m$, $e_i = f_i$ for each $i$ and our two factorizations must have been one and the same. QED.

Conclusion

Fundamental theorems abound all over mathematics. There are fundamental theorems of arithmetic, algebra, calculus, cyclic groups, linear algebra and others. This one, though, really gets at the very makeup of a mathematical entity that all of us understand, at least on a basic level: the positive whole numbers. Cool, no?

Euler's Identity

Mon, 01 Jan 2018 00:00:00 +0000

Introduction

In this post I want to show a few different ways of proving that $e^{i\theta} = \cos\theta + i\sin\theta$. It’s a cute illustration of how it’s often possible and rather cool to look at and solve problems in different ways.

Approach 1

The first technique is one I encountered toward the end of my tenure as a Calc 2 TA last semester as I was going over Taylor and MacLaurin series with my students. If we look at the MacLaurin series for $\sin \theta$ and $\cos\theta$ are

$$ \begin{align*} \cos\theta &= 1 + \frac{\theta^2}{2!} + \frac{\theta^4}{4!} - \dots\\ \sin\theta &= \theta - \frac{\theta^3}{3!} + \frac{\theta^5}{5!} - \dots, \end{align*} $$

$$i\sin\theta = i\theta - \frac{i\theta^3}{3!} + \frac{i\theta^5}{5!} - \dots$$

Now, let’s look at the MacLaurin series for $e^{i\theta}$. Because

$$e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \dots, $$

we have

$$ \begin{align*} e^{i\theta} &= 1 + i\theta + \frac{i^2\theta^2}{2!} + \frac{i^3\theta}{3!} + \dots\\ &= 1 + i\theta - \frac{\theta^2}{2!} - \frac{i\theta^3}{3!} + \dots \end{align*} $$

which, miraculously, is exactly what we get if we interleave (sum up) the terms of $\cos\theta + i\sin\theta$.

Approach 2

The next approach uses techniques from a first course in calculus. We first observe that if $e^{i\theta}$ is going to be the same as $\cos\theta + i\sin\theta$, then the fraction $\frac{cos\theta + i\sin\theta}{e^{i\theta}} = (\cos\theta + i\sin\theta)e^{-i\theta} = 1$. We will show that the second equality holds.

To do this, first define $f(\theta) = (\cos\theta + i\sin\theta)e^{-i\theta}$. Next, we take the (rather annoying) derivative of $f(\theta)$

$$ \begin{align*} f’(\theta) &= e^{-i\theta}(-\sin\theta + i\cos\theta) + -ie^{-i\theta}(\cos\theta + i\sin\theta)\\ &= -e^{-i\theta}\sin\theta + ie^{-i\theta}\cos\theta +-ie^{-i\theta}\cos\theta + e^{-i\theta}\sin\theta\\ &= 0\dots \end{align*} $$

If $f’(\theta) = 0$ at all values of $\theta$, $f$ is constant! Which constant? Let’s plug in $\theta = 0$ and find out!

$$f(0) = (\cos 0 + i\sin 0) e^{-0i} = (1 + 0)(1) = 1$$

If $f$ takes the value 1 when $\theta = 0$ and $f$ is constant, it must take the value 1 everywhere. To sum up, we have

$$f(\theta) = (\cos\theta + i\sin\theta)e^{-i\theta} = 1.$$

Rearranging the second equality by cross multiplying, we see that Euler’s identity holds.

Approach 3

The last technique is my favorite. It uses a bit of linear algebra in concert with differential equations to produce what I think is the most illuminating proof of Euler’s identity. First, consider the differential equation

$$f’’ = -f.$$

A differential equation is what is a kind of functional equation. Rather than trying to find the value of a real-valued variable, we are trying to find a function whose derivatives satisfy a given relationship. In this case, we want to find a function $f$ such that $f$’s second derivative is the same as $f$.

We first note that $\cos\theta$ and $\sin\theta$ are both solutions to this equation:

$$ \begin{align*} (\sin\theta)’’ &= -\sin\theta \\ (\cos\theta)’’ &= -\cos\theta. \end{align*} $$

Because our differential equation involves second derivatives, it’s solutions are sort of analogous to those of a quadratic equation: which is to say, there are two! Formally, we say that the solution space is a vector space of dimension 2. If $\sin\theta$ and $\cos\theta$ are linearly independent solutions, they form a basis of the solution space, which means that every solution to our differential equation must can be written in the form the form $a\cos\theta + b\sin\theta$ for some constants $a,b$.

To see that $\sin\theta$ and $\cos\theta$ are indeed linearly independent, suppose that for all $\theta \in [0,2\pi]$, $a\cos\theta + b\sin\theta = 0$. Let’s pick a particular $\theta$ value in this interval. If $\theta = \pi/2$, then we have $a \cdot 0 + b \cdot 1 = b = 0$ (where the $=0$ at the end is because we supposed that $a\cos\theta + b\sin\theta = 0$ for all $\theta \in [0,2\pi]$). If we pick another, say $\theta = 0$, we have $a \cdot 1 + b \cdot 0 = a = 0.$

So far, we’ve shown that if $a\cos\theta + b\sin\theta = 0$, then we know that $a = b = 0$. This constitutes a proof that $\sin\theta$ and $\cos\theta$ are linearly independent. Because we said the solution space is of dimension 2, they are a basis for the solution space to our original equation.

We can separately observe that $e^{i\theta}$ is also a solution to our equation because

$$ \begin{align*} (e^{i\theta})’ &= ie^{i\theta}\\ (e^{i\theta})’’ &= -e^{i\theta}. \end{align*} $$

Because $\sin\theta$ and $\cos\theta$ form a basis, we know that

$$e^{i\theta} = a\cos\theta + b\sin\theta.$$

for some yet unknown constants $a,b$. Now it just remains to figure out what $a$ and $b$ are. (Can you see what they should be?)

To find $a$ and $b$, we note that if $f(\theta) = e^{i\theta}$, then $f(0) = 1$ and $f’(0) = i$. Using the first condition, we have

$$1 = a \cdot 1 + b \cdot 0 = a.$$

Using the second (taking the derivative of both sides before plugging 0 in), we have

$$i = -a\sin 0 + b\cos 0 = 0 + b = b. $$

Putting these both together, we have $e^{i\theta} = \cos\theta + i\sin\theta,$ which is what we wanted.

The Cantor set

Fri, 27 Oct 2017 00:00:00 +0000

Introduction

In this post, I want to talk about a mathematical construct I read about last night that is just downright fascinating. It’s a great example of the way math can help us make sense of the otherwise-opaque. I give you: the Cantor Set!

The Cantor set

Let $C_0$ be the interval $[0,1]$. Now remove the middle third of the interval to obtain $C_1 = [0, \frac{1}{3}] \cup [\frac{2}{3}, 1]$. Next, remove the middle thirds of the intervals leftover in $C_1$ to construct $C_2 = ([0, \frac{1}{9}] \cup [\frac{2}{9}, \frac{1}{3}]) \cup ([\frac{2}{3}, \frac{7}{9}] \cup [\frac{8}{9}, 1])$ Iteratively continue to remove thirds from the remaining intervals. Doing this, we get a sequence of unions of closed intervals $C_0, C_1, C_2, \dots$. The Cantor Set $C$ is defined as the intersection of the $C_i$; mathematically, we write $C = \cap_{i=0}^\infty C_i$. Visually, the $C_i$ look like

where the topmost line is $C_0$, the second line is $C_1$, and so on.

At infinity

The rest of this post will be spent trying to understand how $C$ is composed. What is left in it after an infinite sequence of cuts?

First notice that 0 and 1 are not deleted during any stage of the process. Generalizing this point, we can see that if some value in $[0,1]$ is an endpoint of some interval at some point during our chain of cuts, it never gets removed. For example, when we delete the middle thirds of the left and right parts of $C_1$ to obtain $C_2$, observe that we don’t touch $0, \frac{1}{3}, \frac{2}{3}$, or $1$. Formally, we would argue that if $x$ is an endpoint of $C_i$, we know two things:

$x \in C_k$ for $k \leq i$.
$x$ is not removed during the construction of any of the other $C_k$ for $k > i$. Thus, $x$ is in every one of the $C_i$ so by definition of $C$ (as the intersection of the $C_i$), $x \in C$. Is there anything else in $C$? If the only numbers left are endpoints of intervals, then $C$ would be a subset of $\mathbb{Q}$ and we would thus conclude that $C$ is countable.

More on this in a minute. One way we might try convince ourselves that there is indeed not much else in $C$ beside “endpoints” is to think about how much of the interval $[0,1]$ is left once we’ve made all of our cuts. To do this, we just need to think about how much we delete on each pass. On the first pass, we delete one interval of length $1/3$ ($1/3^1$). On the second, we delete 2 intervals of size $1/9$ ($1/3^2$). On the third, we delete, 4 intervals of size $1/27$ ($1/ 3^3$). Generalizing this pattern, we see that on the $i$th iteration, we cut $2^{i-1}$ intervals, each of size $1/3^i$. To count up how much length we cut, we just need the sum of

$$\frac{1}{3} + 2\biggr(\frac{1}{9}\biggr) + 4\biggr(\frac{1}{27}\biggr) + \dots + 2^{i-1}\biggr(\frac{1}{3^i}\biggr) + \dots = \frac{1}{3}\sum_{i = 1}^\infty \ \biggr(\frac{2}{3}\biggr)^{i-1}$$

The series is geometric with ratio less than 1, so the sum evaluates to

$$\frac{1}{3}\biggr(\frac{1}{1 - \frac{2}{3}}\biggr) = \frac{1}{3}\cdot 3 = 1.$$

But that’s kinda odd… we started with an interval of length 1, and have cut out… all of it? (Mathematically, we say that $C$ has zero length.)

So is it countable?

At this point, you (as I did) probably thought that the buck stopped here. As expected, $C$ is sparse and small, probably even countable. As with many things in set theory, there is a bit more depth yet to investigate. As our final act, we’re going to show that not only is $C$ not sparse, it’s actually uncountable!

To do this, we are going to take a preliminary result for granted, namely that the set of all infinite sequences of 0s and 1s is uncountable. (If you’re feeling adventurous, take a stab at proving this yourself. If you’re feeling a little less adventurous but you’re still in the mood for a challenge, a hint is that the proof is a diagonalization argument much like Cantor’s proof that the real numbers are uncountable.)

The one-to-one correspondence

We now construct a one-to-one correspondence between sequences of 0s and 1s and elements of $C$. For each element $c \in C$, define $a_i$ — $i \geq 1$ — to be 0 if $c$ falls in the left part of $C_i$, and 1 if it falls in the right part. (Note: if $c$ is in the left part of $C_{i-1}$, the “left” and “right” parts of $C_i$ refer to the left and right parts that result when we cut out the middle third of the left part of $C_{i-1}$.)

Read the sentence in parentheses over again. It is written with unfortunately confusing language but it’s crucial understanding the construction.

To see that this is actually a one-to-one correspondence, note that given a sequence of 0s and 1s, we can “follow” the sequence to pinpoint the exact, unambiguous element of $C$ that the particular sequence represents. Conversely, the construction of the sequence from two paragraphs ago gives us a way to take an element of $C$ and come up with a unique sequence by looking at exactly where the element falls with respect to each of the $C_i$. Given that the set of infinite sequences of 0s and 1s is uncountable, this means that $C$ is actually uncountable too!

Conclusion

When I first read this, my mind was literally blown. Isn’t it amazing?! We’ve somehow come up with a way to remove all of the length from an interval without diminishing its size in the least! By taking our original inquiry and injecting our investigation with a rigorous mathematical approach, we took $[0,1]$, which has length 1, removed all of its length via our construction of $C$, and yet somehow didn’t affect its cardinality.

Infinities don’t always play nice, but that’s why we love them.

The Alternating Series test

Wed, 27 Sep 2017 00:00:00 +0000

Introduction

While I was in college, I spent a few semesters TAing Calc II (Calc BC if you do it in high school). Both when I took the class and when I TAed it, I found the part of the course devoted to infinite series the most interesting by far. It was (and still is) amazing to me that you can — informally — add together an infinite number of terms and get a finite result. In fact, until the 19th century, the above was considered paradoxical and incorrect. To help determine whether or not different series converge, mathematicians developed a suite of tests whose statements are simple enough to be taught to introductory calculus students but whose power cannot be overstated.

In this post, we will prove that the “alternating series test” is valid. While the explanation might be a little bit involved, I hope to include all of the necessary background here. Hopefully, all you’ll need to follow this post is some patience and willingness to challenge yourself a little; this proof certainly challenged me, but I think the effort was well worth it.

If you are unfamiliar with what an infinite series is, see Series (mathematics) - Wikipedia.

The problem

Before we start, we should understand the problem we are trying to solve. Lets say we have some infinite series, and let’s write it down as

$$\sum_{n = 1}^\infty a_n = a_1 + a_2 + a_3 + \dots.$$

The above series may converge or it may not. We can modify this rather plain series by adding together the terms of the same underlying sequence with alternating signs, forming what is known as an alternating series. Symbolically, an alternating series has the form

$$\sum_{n=1}^\infty (-1)^{n-1}a_n = a_1 - a_2 + a_3 - \dots.$$

(Notice that the alternation comes from the fact that -1 to an even power is 1 and -1 to an odd power is -1.) If the non-alternating series converges to a finite sum, then the alternating series clearly does as well.* If not, though, would introducing alternation maybe force convergence? If so, we say that $\sum (-1)^{n-1}a_n$ converges conditionally; if not, oh well… some things just aren’t meant to be.

Preliminaries

More specifically, the AST is concerned with what conditions we need to place on our underlying sequence, $a_n$, so that alternation implies convergence. It turns out that whether an alternating series converges is quite easy to check; the only two things we need to verify are:

The terms of the original (non-alternating) sequence are decreasing. Symbolically, we want $a_1 \geq a_2 \geq a_3 \geq \dots$
$a_n \to 0$.

Our aim for the rest of this post is to prove that if $a_n$ satisfies (1) and (2), then $\sum (-1)^{n-1} a_n$ converges to a finite sum. In order to do this, though, we need a bit of machinery from real analysis, which we discuss next.

Completeness and the monotone convergence theorem

Real analysis (in an oversimplified sense) is the study of properties of the real numbers and functions of a real variable (things like continuity, differentiability, integrability and some other related stuff). The first thing one often does in a real analysis class is to formally discuss what the real numbers are, why we need them, and how to construct them. To go into all of that here would take us pretty far afield, but in short, the real numbers is the first example of a truly continuous set in the sense that there are no holes. (Although the rational numbers are dense, there are holes where irrational numbers — e.g. $\sqrt{2}$ — should be. Integers and natural numbers clearly have holes, e.g. between 1 and 2.)

One of the challenges students (myself included) typically face when discussing and trying to wrap your head around the above for the first time is how to rigorously define this idea that there are no holes (mathematically called completeness). Most commonly, the definition you settle on is that every sequence of real numbers that has an upper bound has a least upper bound (same with lower bounds and a greatest lower bound).

A textbook I’ve been going through actually shows that in addition to the above characterization — often referred to as the Axiom of Completeness — there are (at least) 4 other equivalent ways to characterize completeness, one of which we will use in the proof that we’re going to attempt below. It’s called the Monotone Convergence Theorem (MCT), and it states: any sequence (of real numbers) that is (1) bounded and (2) monotone converges. (A sequence is said to be bounded if there is a number $M$ such that all terms of the sequence are contained in the interval $[-M, M]$. A sequence is monotone increasing if every term is greater than or equal to the preceding term and monotone decreasing if each term is less than or equal to the preceding term; if a sequence is monotone increasing or monotone decreasing, we say it’s monotone, as you might expect.)

The MCT is pretty powerful. Boundedness and monotonicity are often intuitive properties that we can hand-wavily infer about a sequence we are examining and with MCT, we can transform those properties into what is often times the holy grail of sequence (and series) analysis: convergence!

Another technicality we need to address before we tackle our main result is that we need to clarify what we mean when we write $\sum_{n=1}^\infty a_n$. What does it actually mean for $\sum_{n=1}^\infty a_n$ to “converge” to some finite value $S$?

When we say that the sum converges, what we mean is that as we add more and more terms, we get closer and closer to $S$. In other words, to know whether a series converges, we need to know whether or not what we call the sequence of partial sums of $a_n$ — $a_1, a_1 + a_2, a_1 + a_2 + a_3, \dots$ — converges. Technically speaking, let $s_n$ denote the $n$th partial sum of the $a_n$; that is, $s_n = a_1 + a_2 + \dots + a_n$. Then saying that$ S = \sum_{n=1}^\infty a_n$ is the same as saying $S = \lim_{n \to \infty} s_n$. That last sentence is just formalism; what is really important here is that to show that some series converges to a number, we need only show that the partial sums of the sequence’s terms converge. With this in mind, we’re ready for the main result.

(Before continuing, make sure that you understand the main points from the previous two paragraphs.)

Before writing a proof, I often find it helpful to have context for the way the proof is going to unfold so that as I’m writing the proof, I’m able to remember where I am and how it’s supposed to help me get where I want to go. The statement we are trying to prove here is:

Alternating Series Test: If $a_n$ is decreasing and $a_n \to 0$, then $\sum (-1)^{n-1} a_n$ converges.

Proof sketch

The proof will proceed by the following steps:

We will consider two subsequences of the sequence of partial sums: the first will be the subsequence of partial sums that add up an odd number of terms and the second will be the subsequence of partial sums that add up even numbers of terms. We will call the “odd” one $s_{2n+1}$ and the “even” one $s_{2n}$.
We will show that each of the subsequences are bounded.
We will show that each of the subsequences are monotone.
(2) and (3) imply that both converge by MCT, say to limits $L_1$ and $L_2$ respectively.
We will show that $L_1 = L_2$ and will call the shared limit $L$.
We note that the “whole” sequence of partial sums can be made up by interleaving terms of the even and odd subsequences.
If the two subsequences converge to L and we can interleave them to form the original sequence, then it also must converge to $L$. The original sequence was the equivalent partial sum expression of our alternating series, so the proof is complete.

Proof

Now, for the proof.

The odd subsequence looks like:

$$a_1, a_1 - a_2 + a_3, a_1 - a_2 + a_3 - a_4 + a_5, \dots$$

Recall that the $a_n$ are decreasing. This means that subtracting $a_2$ and then adding back a little less than $a_2$ to $a_1$ is going to leave you slightly short of $a_1$. When you subtract $a_4$ from and then add $a_5$ to $a_1 - a_2 + a_3$, you’re going to end up slightly short of $a_1 - a_2 + a_3$. Extending this logic, we arrive at two conclusions. First, we see that the odd partial sums are monotonically decreasing because with every pair of $a_k$ that we tack on to obtain a successive term, we subtract some amount and then add back a bit less than we got rid of. Second, we see that once we depart from $a_1$ (and start our adding and subtracting madness), we never quite make it back. In other words, we can bound the odd partial sums by $a_1$.

(There are formal, symbolic ways of representing all of this, but if you understand the above line of reasoning, you’ve understood the gist, IMO.)

Next, we note that for each $k$, the partial sum $s_{2k} \leq s_{2k+1}$ because the last term of every odd partial sum adds some small positive amount to the previous sum, which was made up of an even number of terms. For example, when $k = 1$, we have

$$s_2 = s_{2 \times 1} = a_1 - a_2 + a_3 - a_4 \leq a_1 - a_2 + a_3 - a_4 + a_5 = s_{2 \times 1 + 1} = s_3.$$

Thus, the even partial sums must be bounded too!

Further, notice that the even partial sums monotonically increase. We can see this by observing that each time we tack a pair of $a_i$, we add a little bit and then subtract a little bit less. When we go from $s_2 = a_1 - a_2$ to $s_4 = a_1 - a_2 + a_3 - a_4$, we take $a_1 - a_2$ and change it slightly by adding $a_3$ and then subtracting a little bit less than $a_3$ ($a_4$).

But wait! We now have established that both the odd and even sums are monotone and bounded, so they must both converge! Let’s call the limit of the sequence of odd partial sums $L_1$ and the limit of the even partial sums $L_2$. To prove that $L_1$ and $L_2$ are the same, all we need to do is show that the difference between them is 0. To see this, we simply observe that

$$\lim_{k \to \infty} s_{2k+1} - s_{2k} = \lim_{k \to \infty} a_{2k+1} = 0.$$

Thus $L_1 = L_2$; we will henceforth refer to the limit as $L$.

We can reconstruct the sequence of partial sums representing our original series by interleaving terms of the odd and even subsequences. Because both subsequences tend to $L$, we can conclude that the sequence constructed from the interleaved subsequences also tends to $L$. This completes the proof, as we’ve provided the desired finite limit for our alternating series’ partial sums.

Conclusion

Sometimes we’re taught formulas and tests in class and are — rather unfortunately — not challenged to understand why they work as well as they do. As I’ve spent a little bit of time going back through some of the things I took for granted when I first encountered them, I’ve found that the stuff behind the curtain is often very interesting and illuminating. In this particular case, I hope you have too.

*If adding a bunch of positive (the same type of argument will work if some terms are negative) terms together gives some finite positive result, then subtracting some of the terms instead of adding them will certainly give a finite result as well. The sum of the alternating series is thus bounded above by the sum of the non-alternating terms.

Memorization as a caching mechanism

Sat, 26 Aug 2017 00:00:00 +0000

I was talking to my dad the other day about the advantages and disadvantages of memorizing things and we got into an interesting discussion, the essence of which I thought would make a great topic for a blog post.

My argument against memorization has been (and will continue to be) that I don’t find topics that require lots of memorization that interesting. This is not to say that I don’t think memorization important; on the contrary, I envy those who can remember names of people and the most minute details of conversations they had. The things that I really enjoy learning, though, are the things that require a deeper understanding than that. I think that this need for fundamental understanding is one of the main things that led me to the math major as an undergrad. As our conversation continued and we talked about the way my dad studies music, we expressed differing opinions about the value of memorizing things. As a (now former) computer science student, I started to think of and bring up the “computational” benefits and drawbacks of this approach and ran into the too familiar space/time tradeoff.

On the one hand, being able to reproduce things saves you brain space. You could, for example, remember the essence of a particular recipe without remembering the exact particulars, and then reproduce the recipe just using the main idea — the dish is a savory version of some dish you’ve made before with a garnish of potatoes.

On the other hand, this on-the-fly relearning could (1) take time and (2) cost you mistakes. The other option is to memorize the recipe, item for item. Using that approach, you are less prone to error and you can put the recipe together much faster because you don’t have to reason through anything.

I realized that to me, the primary benefit of memorization is its utility as a time-saver. When information comes in a small enough package, the time you save by memorizing it once might be worth all the time you might spend reproducing it over and over later. (For those with some more technical background, I think of memorization as a caching mechanism of sorts.) Additionally, if the information is information you’re going to need to reproduce over and over, it might be worth memorizing it to save yourself the time in the long run. On the other side, if the “information packet” is unwieldy, though, and you know that you probably won’t need it that often, it might be worth “compressing” the data into its essential bits and reproducing it when you have to, so as to save yourself the headspace.

I always appreciate when the things I study and think about at work make their ways into my daily life; this was a fun example.

Proving √2 is irrational

Wed, 26 Jul 2017 00:00:00 +0000

Introduction

When you first encounter number systems, the story usually goes something like this:

There are obviously positive whole numbers. As Leopold Kronecker said: “God made the integers, the rest is the work of man.” (While it’s true that you can construct $\mathbb{N}$ set theoretically, we will take the natural numbers as given here.)
In order to add the idea of additive inverses — which would imbue the natural numbers with richer algebraic structure — we construct $\mathbb{Z}$ (which includes the negative numbers).
Next, we note that while multiplying members of $\mathbb{Z}$ leaves us inside $\mathbb{Z}$, division, the anti-multiplication, does not. To accommodate division and to augment $\mathbb{Z}$ to include the idea of multiplicative inverses (read: reciprocals), we construct the rational numbers (denoted $\mathbb{Q}$), i.e. the set of quotients of integers with the caveat that the denominator of said quotient must be nonzero. (Note: This is also probably the first example of an algebraic structure called a field that students encounter.)
Then one of our professors observes that although $\mathbb{Q}$ is dense in a way that $\mathbb{Z}$ is not, it still has holes that are occupied by irrational numbers. To bring irrational numbers into the fold and finally construct a system known as the continuum (it has no holes), we construct $\mathbb{R}$, the real numbers from $\mathbb{Q}$.

Confusion

During step 4, professors often introduce the irrational numbers by appealing to the mysterious $\sqrt{2}$. The question I want to tackle in this post is: what does that symbol mean? At first, I might have thought that $\sqrt{2} = 1.414\dots$ The problem with this line of reasoning is that it presupposes the number’s existence in order to tell us what it is. Another thing it might try is to argue that $\sqrt{2}$ is the limit of the following sequence rational numbers: $1, 14/10, 141/100, 1414/1000, \dots$, but this explanation is fraught with the same circularity as my first attempt was. So how might I prove that there is some real number whose square is 2 without assuming such a number exists? To do this, I will appeal to the completeness of the real numbers. That $\mathbb{R}$ is complete means that every subset $S$ of real numbers that is bounded above has a least upper bound $\beta$. We write this compactly as $\beta = \sup S$. (We analogously define the greatest lower bound of a set $S$ if $S$ is bounded below — we denote said lower bound as $\inf S$.) Below we will require the technical definition of a least upper bound, so I will provide its two components here:

For $x \in S$, $\beta \geq x$.
If $\gamma$ is an upper bound of $S$, $\beta \leq \gamma$. Look at the above for a second and make sure they jive with your intuition about what a least upper bound is. It’ll make the rest of this much easier to understand.

Now, given the completeness of $\mathbb{R}$, consider the set $S = \{t \in \mathbb{R} | t^2 < 2\} \subseteq \mathbb{R}$. It is clear that this set is bounded above by 2. By completeness, this tells us that there is some real number $\alpha = \sup S$ that is $S$’s least upper bound. To show that there is a real number that you can square to get 2, we just need to show that $\alpha^2 = 2$. We will do this by ruling out the possibilities that $\alpha^2 > 2$ and that $\alpha^2 < 2$.

Ruling things out

If $\alpha^2 < 2$, then consider the number (we will discuss what $n$ to pick in a second):

$$(\alpha + \frac{1}{n})^2 = \alpha^2 + \frac{2\alpha}{n} + \frac{1}{n^2} < \alpha^2 + \frac{2\alpha + 1}{n}.$$

What value of $n$ suits our needs here? Well, what we *would like *to show here is that $(\alpha +\frac{1}{n})^2 < 2$, thus contradicting property (1) of least upper bounds that $\alpha$ must satisfy. As such, we need to pick an $n$ such that $\frac{2\alpha + 1}{n}$ fits in the gap between $\alpha^2$ and 2. Put mathematically, we want $\frac{2\alpha + 1}{n} < 2 - \alpha^2$.

Rearranging a bit, we see that for the above to be true, we need $\frac{1}{n} < \frac{2 - \alpha^2}{2\alpha + 1}$. Such an $n$ surely exists (for more technically inclined readers, this follows from the archimedeanity of $\mathbb{R}$), so we now would like to show that $\alpha + \frac{1}{n} \in S$. This is now simple, because with the value of $n$ we’ve chosen, we have

$$\alpha^2 + \frac{2\alpha + 1}{n} < \alpha^2 + (2 - \alpha^2) = 2.$$

Thus, if $\alpha^2 < 2$, $\alpha + \frac{1}{n} > \alpha$ is a member of $S$. This directly contradicts our supposition that $\alpha$ is the least upper bound of $S$ — we’ve found a number larger than $\alpha$ that is in $S$. Now, if $\alpha^2 > 2$, we proceed by a similar argument. Consider the number $(\alpha - \frac{1}{n})^2 = \alpha^2 - {2\alpha}{n} + \frac{1}{n^2} > \alpha^2 - {2\alpha}{n}$ This time, we want to show that $\alpha - \frac{1}{n}$ is an upper bound for $S$, thus contradicting the supposition that $\alpha$ is the least upper bound of $S$. To this end, again, choose $n$ so that $\frac{2\alpha}{n}$ fits in the gap between $\alpha^2$ and 2. Mathematically, we want $\frac{2\alpha}{n} < \alpha^2 - 2$, so we choose $n$ so that

$$\frac{1}{n} > \frac{\alpha^2 - 2}{2\alpha}.$$

By doing this, we get that

$$(\alpha - \frac{1}{n})^2 > \alpha^2 - {2\alpha}{n} > \alpha^2 - (\alpha^2 - 2) = 2$$

But this contradicts the supposition that $\alpha$ was $S$’s *least *upper bound! Thus, $\alpha^2 = 2$, so we can rigorously associate the symbol $\sqrt{2}$ with $\alpha$.

Conclusion

I guess what I was aiming to show here is that math gives us tools to find answers to questions we don’t initially see yet are nonetheless critical to the logical soundness of what we hope to build (and sometimes have already built — yikes!) on top of them. These — often fundamental — questions are sometimes hidden behind veils of apparent obviousness. The task of securing air-tight foundations upon which mathematicians do their work is a subtle game, and I thought this was an accessible example of some of that subtlety.

The Basel problem

Fri, 26 May 2017 00:00:00 +0000

Introduction

In this post, I want to talk about what the Basel Problem is and how Euler solved it. Even today, Euler remains one of the most accomplished mathematicians there ever was. His work impacted and created a multitude of fields across mathematics: number theory, graph theory and topology, real and complex analysis, and parts of physics. His solution to the Basel Problem, which we will discuss below, catapulted him to fame in 1734.

Background

Before we attack the Basel Problem itself, I want to set the stage. Students who have taken an introductory course in calculus are familiar with the harmonic series:

$$\sum_{n=1}^\infty \frac{1}{n} = 1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + \dots.$$

This series is the first example students are usually given of a series that diverges even when its individual terms tend to 0. The fact that the harmonic series diverges was originally proved by Oresme in the 15th century, but his work was lost. About 250 years later, in the mid 17th century, the Bernoulli brothers re-proved the result. Their success sparked their interest in the convergence/divergence of other infinite series, one of which happened to be a natural extension of the harmonic series:

$$\sum_{n=1}^\infty \frac{1}{n^2} = 1 + \frac{1}{4} + \frac{1}{9} + \frac{1}{16} + \dots.$$

The Basel Problem was to find the sum of this series. As a note before we see how Euler did it, figuring out whether an infinite sum converges is typically a much easier problem than computing the sum. It’s easy to show that $\sum \frac{1}{n^2}$ converges using any one of a number of simple convergence tests (e.g. comparison, integral), but finding the actual sum is a different matter entirely. This brings us to our main result.

Euler’s Argument

What we want to prove can be stated quite succinctly.

Theorem: $\sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}$.

Euler started off by considering the infinite polynomial

$$f(x) = 1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \dots + \frac{(-1)^k x^{2k}}{(2k + 1)!} + \dots.$$

For $x \neq 0$,

$$f(x) = \frac{xf(x)}{x} = \frac{x - \frac{x^3}{3!} + \frac{x^5}{5!} - \dots + \frac{(-1)^k x^{2k + 1}}{(2k + 1)!} + \dots}{x}.$$

Note that the numerator of $f(x)$ is actually the Taylor expansion of $\sin(x)$, so we can actually (for $x \neq 0$) write

$$f(x) = \frac{\sin(x)}{x}.$$

Provided that we can find the roots of $f(x)$, we can factor it. In our case, $f(x) = 0$ when $\sin(x) = 0$, so the infinitely many roots $f$ are $k\pi$ for all integers $k$. We can thus factor $f$ as:

$$f(x) = (1 - \frac{x}{\pi})(1 + \frac{x}{\pi})(1 - \frac{x}{2\pi})(1 + \frac{x}{2\pi})\dots.$$

(This factorization comes from a theorem in algebra that states that for a polynomial $p(x)$, if the roots of $p$ are $a_1, a_2,\dots, a_n$ and $p(0) = 1$, then you can factor $p(x) = (1 - x/a_1)(1 - x/a_2)\dots(1-x/a_n)$.)

Next, we observe that each pair of factors of the form $(1 - \frac{x}{k\pi})(1 + \frac{x}{k\pi})$ can be combined into $1 - \frac{x^2}{k^2\pi^2}$, so f now looks like

$$f(x) = (1 - \frac{x^2}{\pi^2})(1 - \frac{x^2}{2^2\pi^2})(1 - \frac{x^2}{3^2\pi^2})\dots$$

Okay… this is good, because $f$ smells of both $\pi$ and the squares of natural numbers…

To force these pieces into place, Euler then multiplied out our factorization and collected the coefficients of like powers of $x$. In particular, as it pertains to the our problem, he only cared about the coefficient on the $x^2$ term of $f$. The only $x^2$ terms in $f$ are those that are produced by multiplying an $x^2$ term with a 1. Adding up these terms, we get the following sum:

$$-\frac{x^2}{\pi^2} - \frac{x^2}{4\pi^2} - \frac{x^2}{9\pi^2} - \dots = x^2\biggr[-\frac{1}{\pi^2}\biggr(1 + \frac{1}{4} + \frac{1}{9} +\dots\biggr)\biggr],$$

so the coefficient is

$$-\frac{1}{\pi^2}\biggr(1 + \frac{1}{4} + \frac{1}{9} +\dots\biggr).$$

From our original representation of $f$ (before we multiplied it by $\frac{x}{x}$), we know what the coefficient should be, namely $-\frac{1}{3!}$. Equating the coefficient we found with what we know it should be, we have

$$-\frac{1}{3!} = -\frac{1}{\pi^2}\biggr(1 + \frac{1}{4} + \frac{1}{9} +\dots\biggr).$$

which, rearranged a bit, solves the Basel problem because it means that

$$\frac{\pi^2}{6} = 1 + \frac{1}{4} + \frac{1}{9} + \frac{1}{16} + \dots.$$

Conclusion

Before I conclude, I want to clarify that this argument was not entirely rigorous. Without justification, he produced much of his argument by extending finite results to the infinite.

Nonetheless, his solution to the Basel Problem is a great example of the ingenuity with which Euler attacked many of the open mathematical conundrums of his day. Despite its reputation as rote and formulaic, mathematics requires a wild imagination. Euler’s was one of the wildest, and mathematicians who inherited his legacy could not be more thankful.

TSP is inapproximable

Wed, 26 Apr 2017 00:00:00 +0000

Introduction

As an introductory computer science student, I was enamored by the Traveling Salesman Problem (if you’ve never heard of it, see Travelling salesman problem - Wikipedia). It is very easy to state and has very simple and important practical applications, yet somehow my professors were telling me that we don’t, at present, have an efficient algorithm to solve it. There are heuristics, yes, but you can (easily) design pathological inputs on which whatever heuristic you design performs horribly.

Approximation algorithm

It’s often the case that when faced with $NP$-hard optimization problems (such as TSP), instead of heuristics, we try to design approximation algorithms, or algorithms that provably produce outputs within some acceptable factor of the answer. To be a bit more precise about it:

An algorithm $A$ is an $\alpha$-approximation for a maximization problem $P$ if on every instance $I$ of $P$, $A$ produces a solution of size at least $OPT/\alpha$ (where $OPT$ is the size of the optimal solution for the instance $I$).
$A$ is an $\alpha$-approximation for a minimization problem if $A$ produces a solution of size at most $\alpha OPT$.

Note that $\alpha \geq 1$.

Inapproximability

There are myriad approximation algorithms out there for a bunch of different problems; I may make them the subject of a future blog post… but in the remainder of this post, I want to actually show that for any $\varepsilon > 0$, it is impossible to come up with a $(1-\varepsilon)$-approximation for TSP unless $P = NP$ (in which case we wouldn’t need approximations, we would have a polynomial time algorithm to solve the problem exactly!). In other words, unless $P = NP$, not only do we not have an efficient algorithm for TSP, we can’t even approximate it efficiently! The proof of this fact, which I find surprising and kind of amazing given the amount of effort and brainpower that have been thrown at this problem over the years, is pretty simple, so I thought it would be fun to go through it here.

(Before we do, if you aren’t familiar with it, read Hamiltonian path problem - Wikipedia)

The proof

We’ll start with a sketch of the proof, then move on to the actual proof.

Proof idea: The proof is by contradiction. We will assume we have a polynomial time approximation for TSP and use it as a black box to solve a known $NP$-complete problem, HAM-CYCLE, in polynomial time. Assuming $P \neq NP$, this is impossible, so such an approximation cannot exist.

Proof: Suppose that $A$ is a polynomial time approximation for TSP. Use $A$ to construct the following algorithm $A’$ which, on some input graph $G = (V,E)$, computes a solution (a YES or NO) to HAM-CYCLE:

Create the graph $G’ = (V, E’)$ by completing $G$ (i.e. by adding edges to $E$ until there are edges between every pair of vertices in $V$).
Give the edges in $E$ weights of 0, and give those in $E’ - E$ weights of 1. (Note that $G’$ is an instance of TSP.)
Use $A$ to approximate the least cost tour $T$ in $G’$.
Output NO if $T$ has weight $> 0$ and YES otherwise.

We just need to argue that $A$ outputs a tour of weight 0 if and only if there is a hamiltonian cycle in $G$. To see this, note that by definition, $A$ finds a tour whose combined weight is within some factor of the optimal tour on $G’$. If the optimal tour on $G’$ can indeed be made up only of edges from $G$, it has weight 0, in which case $A$ would have to return an answer within some factor of 0… namely 0. If the A finds a tour with weight $> 0$, then it must not have been able to find a tour using only edges of $G$, in which case we can safely output that there is no hamiltonian cycle in $G$. QED.

Conclusion

And with that, just a few short words, we were, assuming $P \neq NP$, able to rule out all possible approximation schemes that anyone could ever think of! Imagine all the time and effort we’ve saved!

Hilbert's hotel

Wed, 05 Apr 2017 00:00:00 +0000

Introduction

It’s often the case that when people try to reason about infinities, they get lost in forests of paradoxes. More precisely, they stop being able to intuitively make their way around the mathematical landscape. You see, infinity isn’t something we deal with in our daily lives. You could probably argue that we are biologically predisposed to have trouble with it; Scott Aaronson wrote a great article that speaks to this.

A few months ago, while I was TAing a class on ideas in mathematics, I was asked to give one of the lectures. I decided to make it about the basics of infinity and in it, I walked through things like what it means for infinite sets to have the same size, Cantor’s diagonalization argument and some other related items of interest (right out of some of my earlier blog posts). I then talked about one of the famous examples of infinity’s mindbending amazingness: Hilbert’s hotel. The students in the lecture seemed into it, so I told myself after that lecture that I would try to write something about it here… this is that something.

A few people

I run a strangely constructed hotel. It has one infinitely long hallway with countably infinite rooms, numbered 1, 2, 3, 4 and on and on and on. There’s just one problem, though: all of the rooms are occupied and I have a guest at the front desk who wants a room. How might I accommodate him?

(Think about this for a minute… what would you do?)

After much thought and consulting some mathematically oriented consultants, I decided to have everyone move over by 1 room. That is, the person in room 1 moves to room 2, the person in room 2 moves to room 3 and so on. Mathematically, the person in room $n$ moves to room $n + 1$. As is hopefully clear, by doing this, room 1 is now open and, after I get some of the staff to clean it, ready for my new guest. Extending this method further, we can actually accommodate any finite number $k$ of guests by just having everyone move over $k$ rooms. For example, if $k = 78$, we would put the guest currently in room 1 in room 79, the person in room 2 into room 80 etc. This would free up the first $k$ rooms and enable us to accommodate the $k$ new guests.

A caravan

Ok, that wasn’t too bad, and the solution seems plausible enough. A few months later, however, I encountered, shall we say, a bigger problem. A countably infinite number of people came to the desk and said they all wanted rooms. Thinking back to how I accommodated a finite number of guests, I quickly realized that this wasn’t going to work. I can’t exactly ask a guest in room 1 to move to room $\infty$ and the person in room 2 to move to room $\infty + 1$; they’d never stop walking down that hallway! What kind of hospitality would that be?! The only acceptable way to accommodate new guests is one that assign each guest that currently has a room a specific new room… can I do this for an infinite number of guests?

Turns out I can. After consulting my friends again, we decided that the easiest way to accomplish this would be to free up all of the odd numbered rooms. Formally, we would move the guests in room $n$ to the room $2n$. Note that no two guests get assigned the same room; the guest in room 7 is the only guest who ends up in room 14. Observe also that after the move, the only occupied rooms are the even numbered rooms because any room that we moved someone to with our rule is even numbered. Thus all of the odd rooms are open and we can put our guests into the odd numbered rooms.

Why did this work mathematically? I’m going to assume some knowledge of the terminology that follows, but the reason is that the function $f : \mathbb{N} \to \mathbb{E}$ (where $\mathbb{E}$ is the set of even numbers) given by $f(n) = 2n$ (our rule) is actually a bijection (or one-to-one correspondence). This means we can match each natural number with exactly one even number and that every even number gets hit (think about this for a minute; try to convince yourself of this). In some sense, the fact that $f$ is bijective tells us that we can fit $\mathbb{N}$ into $\mathbb{E}$ (which is actually itself a subset of $\mathbb{N}$). Doing this frees up any rooms whose labels aren’t in $\mathbb{E}$, namely the odd rooms. What we’ve actually done here is shown that in a mathematically precise way, there are as many even numbers as there whole numbers… cool, no?

Caravans and caravans

At this point, I thought I was good to go. Now that I know how to accommodate an infinity of guests, what more could I possibly need to know? Turns out, my biggest challenge of them all was yet to come. A few years after solving the infinite guest problem, a countably infinite number of caravans each carrying a countably infinite number of guests showed up and said that all of them wanted rooms. This seemed rather daunting… how was I going to tackle this one?

Well, I said, when the going gets tough, the tough get going. So, naturally, I called my friends again ;) This time, they suggested the following scheme:

Free up the odd rooms (we know how to do this already).
Label each caravan with the odd primes in ascending order (there are infinitely many of these, so we’ll have enough for all caravans).
For each caravan, label the people in it 1, 2, 3, 4 etc.
Put person $n$ from the caravan labeled $p$ into room $p^n$.

Ok, seems simple enough. But does it work?

To show that it does observe the following few simple facts:

If $p$ is an odd prime, then $p^n$ will also be odd, so I will never assign a
guest to an even room.
A power of $p_1$ will never contain a factor of $p_2$ (e.g. $5^3$ can’t contain any prime factors except for $5$); this means there is no way that I’ll accidentally assign two people from different caravans to the same room.
$p^i \neq p^j$ for $i \neq j$, so we see that with this rule, I won’t assign any pair of people from the same caravan to the same room.

Given these three facts, we see that this rule successfully gives all of our guest rooms. It even leaves a bunch of rooms unoccupied! Can you see which ones?

(As a technical aside, we just showed that you can match a set of size $\mathbb{N} \times \mathbb{N}$ — our caravans — with a set of size $\mathbb{N}$ — the odd numbers — in a one to one correspondence. Without going into too much detail, by noting that $\mathbb{Q}$, the set of rational numbers, has size $\mathbb{N} \times \mathbb{N}$, the last leg of our journey is actually a proof of the somewhat surprising fact that $|\mathbb{Q}| = |\mathbb{N}|$.)

Conclusion

It’s been a pleasure taking you through some of my experiences as a hotel manager. As of late, business is booming and I couldn’t be happier. You can bring as many friends as you like and I can almost guarantee we’ll be able to make room for all of you. If you bring uncountably many friends, though, I’ll have to send you to Cantor’s hotel down the street.

The derivative via linear algebra

Wed, 15 Mar 2017 00:00:00 +0000

Introduction

As a math major in college, I had been, for a long time, been under the impression that calculus and algebra were totally separate parts of math. The types of problems you thought about in one of them were totally disjoint from the types of problems you tackled in the other. Continuous vs. Discrete. Algebraic this vs. Analytic that. As I was watching (a wonderful) video series on linear algebra by 3blue1brown, I came across the following really cool connection between calculus and algebra that was simple, elegant and clever. But, more importantly, it spectacularly illustrates the connections that one finds between seemingly separate parts of math. Let’s take a look at finite polynomials.

Polynomials

Polynomials are mathematical objects of the form $p(x) = a_0 + a_1x + a_2x^2 + \dots + a_nx^n$, where the $a_i$ are scalars drawn from a field $F$ and $n$ is an arbitrary natural number. It’s easy to check that polynomials actually make up a vector space over $F$. Formally, this means that:

Polynomials make up a commutative group under addition.
1. There’s an identity element ($p(x) = 0$ is the identity).
2. Every element has an inverse (the inverse of $p(x)$ is $-p(x)$).
3. Addition is associative ($p(x) +(q(x) + r(x)) = (p(x) + q(x)) + r(x)$)
4. Adding two polynomials always produces another polynomial (this property is called closure).
If you multiply a polynomial by a scalar, the result is yet another polynomial.
Scalar multiplication distributes over polynomial addition (if $p$ and $q$ are vectors and $c$ is a scalar, the three must satisfy $c(u + v) = cu + cv)$.
Vector multiplication distributes over scalar addition (if $c_1$ and $c_2$ are scalars and $v$ is a vector, they must satisfy $(c_1 + c_2)v = c_1v + c_2v$).
There must be a multiplicative identity ($p(x) = 1$).
If $c_1$ and $c_2$ are scalars and $p$ is a polynomial, then $c_1(c_2v) = (c_1c_2)v$.

I’m going to skip verifying these, but if you think about them, they’re mostly (if not all) sort of intuitive. For the rest of the post, we are just going to assume that polynomials make up a vector space.

Calculus detour

Let’s jump over to calculus for a minute. Do you remember how we differentiate a polynomial? For example, if $p(x) = 3x^2 + x + 7$, what is $D(p(x))$? If we recall our first calculus course, we remember that we were told that we could differentiate each of $3x^2$, $x$ and $7$ separately and then add the results together. Furthermore, we have two differentiation rules that will help us differentiate a single term:

$D(x^n) = nx^{n-1}$.
$D(cf(x)) = cD(f(x))$ (you can pull out constants).

With these rules in hand, we see that the derivative of $3x^2$ is $3 \cdot 2x = 6x$, the derivative of $x$ is 1 and the derivative of 7 (or any other constant, for that matter) is 0. Adding these together, we conclude that $D(p(x)) = 6x + 1$. Okay, now reread the calculus we just thought through and keep it in mind; we have to jump back to linear algebra for a second.

Differentiation is a linear map

If I have two vector spaces $V$ and $W$ over a field $F$, then a map $T:V \to W$ is said to be linear if:

$T(u + v) = Tu + Tv$.
$T(cu) = cTu$ (for $c \in F$).

The first rule says says applying a linear transformation to a sum of vectors should produce the same result as if you applied the transformation to each result and then added them in the target space. This looks kind of familiar, doesn’t it? Above, when we computed $D(p(x))$, we took $p(x)$ apart, applied $D$ to each part, and then put the results back together… In other words, we said that

$$D(p(x)) = D(3x^2) + D(x) + D(7).$$

It’s easy to see that this rule, whereby we are allowed to decompose things, work on them, and put them back together, generally applies to the differentiation of any polynomial, so we’ve established that the polynomial differentiation operator $D$ satisfies the first property of linear maps!

Furthermore, if we look at the second differentiation rule that helped us up above, it is exactly the second property of linear transformations! (Just replace $T$ with $D$ and $u$ with some polynomial $p(x)$.)

We thus see that the operator $D$, which takes the derivative of a polynomial, is linear!

To sum up what we’ve said so far:

The space of polynomials is a vector space (we will henceforth call $P$).
The differentiation operator, $D$, is a linear transformation from $P$ to itself (because differentiating a polynomial always gives another polynomial).

Thus, once we produce a convenient basis for $P$, we can actually write down a matrix that will do differentiation of polynomials for us! But what basis should we use?

Because polynomials in $P$ can have arbitrarily large degree, our basis will actually be infinite. The basis we choose is actually inherent in the general structure of polynomials. Can you see what it might be? Because polynomials of degree $n$ are just linear combinations of the infinite list $\{1, x, x^2, \dots\}$ (e.g. $3x^2 + 4x + 3$ can be seen as $3 \cdot 1 + 4 \cdot x + 3 \cdot x^2 + 0 \cdot x^3 + 0 \cdot x^4 +\dots$), we will call this set our basis (verify span and linear independence!) and now use it to write down a(n infinite) matrix corresponding to $D$.

Note that the $i$th column of a matrix describes what the transformation does to the $i$th basis vector of our space. So, in order to write down the first column of $D$’s matrix, we need to know what $D(1)$ is written as in terms of $P$’s basis vectors. Well, if $D(1) = 0 = 0 \cdot 1 + 0 \cdot x + \dots$, then the first column of our matrix must be

$$\begin{bmatrix} 0 \\ 0 \\ 0 \\ \vdots \end{bmatrix}.$$

To determine the next column, we look at what $D$ does to our second basis vector, $x$. $D(x) = 1 = 1 \cdot 1 + 0 \cdot x + \dots$, so the second column of our matrix would look like

$$\begin{bmatrix} 1 \\ 0 \\ 0 \\ \vdots \end{bmatrix}.$$

The last basis vector we need to look at before we can intuit the rest of the columns is $x^2$. $D(x^2) = 2x = 0 \cdot 1 + 2 \cdot x + 0 \cdot x^2 + \dots$, so the third column is

$$\begin{bmatrix} 0 \\ 2 \\ 0 \\ \vdots \end{bmatrix}.$$

You could probably guess the next column, and the one after that, and, most probably, all of the ones after that… we finally have this matrix for $D$ (note that it’s infinite):

$$A = \begin{bmatrix} 0 & 1 & 0 & 0 & \dots \\ 0 & 0 & 2 & 0 & \dots \\ 0 & 0 & 0 & 3 & \dots \\ \vdots & \vdots & \vdots & \vdots\end{bmatrix}.$$

If you represent a polynomial as a(n infinitely long) vector of its coefficients, then you can actually do differentiation with this matrix. For example, if your polynomial was $p(x) = 4x^3 + 5x^2 + 29x + 9$, you would perform

$$A\begin{bmatrix} 9 \\ 29 \\ 5 \\ 4 \\ 0 \\ \vdots \end{bmatrix} = \begin{bmatrix} 29 \\ 10 \\ 12 \\ 0 \\ 0 \\ \vdots \end{bmatrix},$$

i.e. your derivative is $12x^2 + 10x + 29$, which, using the rules you learned in calculus class, is demonstrably correct.

Conclusion

To sum up, we’ve reconceived the space of polynomials as a vector space and used notions from both linear algebra and calculus to come up with a pretty nice looking matrix that doesn’t intuitively look like differentiation, but that somehow perfectly describes it when you look at it through the right lens. There are connections like these all over mathematics, you just have to know where to look.

Randomized algorithm for file comparison

Sat, 25 Feb 2017 00:00:00 +0000

Introduction

The next problem we use randomization to solve might seem a bit closer to one we might face in reality. It goes like this. Alice and Bob each have copies of the same file that they need to keep synchronized (call Alice’s file $A$ and Bob’s $B$). Over time, however, it’s possible that the Alice’s and Bob’s files get out of sync. Our task today is to come up with a protocol by which Alice and Bob can check that $A = B$ without one having to send the other his/her entire file.

The algorithm

In order to throw math at the problem the way we want to, we need to make it a bit more abstract. To this end, we stipulate that $A$ and $B$ are represented by $n$-bit strings. The comparison protocol works as follows:

Alice picks a prime $p \in \{2..n^2\lg n\}$ (fear not; we will explain the choice of this range soon). Because $A$ is an $n$-bit string, we can look at it as an $n$- bit binary integer. Alice computes $A \pmod p$ and sends Bob the prime $p$ and $A \pmod p$. (Note: computing $A \pmod p$ means the remainder of $A$ left over when we divide it by $p$.)
Bob computes $B \pmod p$. If $A = B \pmod p$ ($A$ and $B$ leave the same $p$- remainder), Bob outputs that the files are the same. Otherwise, he should conclude that the files he and Alice have are out of sync.

The question we need to answer is now: How confident can we be that the files are indeed the same when Bob says “same”? In other words, we want to know what the probability is that the algorithm errs.

Analysis

There are two cases to analyze here. If the files were the same to begin with, the algorithm will never fail. Mathematically, if $A = B$, then $A = B \pmod p$ so Bob will always output “same” in this case.

The interesting case to analyze is the case in which $A \neq B$. In this case, we are interested in

$$\Pr[A = B \pmod p ~|~ A \neq B].$$

In English, we are interested in the probability that Bob outputs “same” even when his and Alice’s files are not in sync. To analyze this probability, we need to entertain a quick tangent.

In particular, we need to motivate our preference that $p \in \{2..n^2\lg n\}$. There is a neat theorem in number theory called the Prime Number Theorem which states that in the range $\{2..N\}$, there are about $\frac{N}{\lg N}$ primes. We can show, using the theorem and some algebra that we need not bother ourselves with here, that there are about $n^2$ primes in the range from which we drew $p$ (start by substituting $N = n^2 \lg n$ in the theorem).

Keep this fact in mind. Let $C = |A - B|$. We can reformulate the probability of the protocol failing as $\Pr[C = 0 \pmod p ~|~ C \neq 0]$.

Next, we note that because $A$ and $B$ both have $n$-bits, $C$ too must have $n$- bits. This means that $1 \leq C \leq 2^n$. Note that 2 is prime, so a nice feature of $C$ that we can observe is that $C$ has at most $n$ prime divisors. This is the key fact. Because we had $n^2$ primes from which to choose $p$ and there are (at most) $n$ bad choices among them (a choice is bad if $p$ divides $C$ and thus $A- B$, whence $p \mid A$, $p \mid B$), the probability that $p$ is a “bad prime” is at most $\frac{n}{n^2} = \frac{1}{n}$.

The probability of success is thus at least $1 - 1/n$, which is great because, intuitively, it means that as our strings get larger, the odds of the algorithm failing get very, very small.

Space complexity

The last minor thing we note is that the number of bits required for this protocol is only the number of bits required to represent $p$ and $A \pmod p$ (what Alice sends to Bob). The $A \pmod p$ is smaller than $p$, so we can write an upper bound on the total number of bits required as $2\lg p$ bits. Because $p$ is at most $n^2 \lg n$, we require $2 \lg(n^2\lg n) = \lg (n^4(\lg n)^2) = 4 \lg n + 2\lg(\lg n) = O(\lg n)$ bits to, with high probability, successfully compare the files. Our goal was to share a sublinear number of bits relative to the size of the file, so the protocol above achieves the desired aim.

Conclusion

You’ll note that if you go back and look at the matrix multiplication post I put up a little while ago, the analysis here and there are very similar. In each case, we had a problem in which we needed to compare two objects without comparing the entire objects to one another. In this case, the objects were strings; in the matrix case, the objects were matrices. In both cases, we devised schemes wherein we mapped the larger objects to smaller ones that were much easier to compare; although the representations of the smaller objects are lossy, we choose our mapping carefully so that the information we sacrifice only introduces a small probability of failure. The technique described above is called fingerprinting, and it is a very powerful tool used in the study and design of randomized algorithms.

Randomized matrix multiplication checking

Sat, 25 Feb 2017 00:00:00 +0000

Introduction

Warning: This post is pretty technical. It details a part of a lecture from a class I’m in this semester. It requires some mathematical maturity. I’ll to my best to make it as accessible as possible, but how effective I’ll be at that remains a mystery… Here we go!

Background

The simplest algorithm for matrix multiplication takes $O(n^3)$. As of right now, the fastest algorithm we have for matrix multiplication runs in $O(n^{2.373})$ time. This algorithm is very complicated and I won’t go into detail here (mostly because I don’t know them), but should we devise an algorithm that is even faster than that, we might want an efficient algorithm to check that it computes the correct result. A little bit of thought shows us that the best conceivable complexity for matrix multiplication is $O(n^2)$ — that is the amount of time it would take us to write out the result even if no additional computation was necessary. This post details a randomized algorithm that checks the correctness of matrix multiplication in $O(n^2)$ time. Now, to the formal problem statement…

Suppose you had three $n \times n$ matrices $A, B$, and $C$ and I wanted to know whether $AB = C$.

First approach

A first approach is to choose a random entry $C_{ij}$ and check that it was computed correctly by checking that it equals the dot product of row $i$ of $A$ and column $j$ of $B$. Each such check takes linear time, so we can do up to $n$ of them and still stay at our desired runtime of $O(n^2)$. This seems clever enough… I mean, $n$ entries is pretty good, right? If all of those entries were computed correctly, we can be reasonably sure that $AB = C$… right?

Trying again

Nope! The problem with this approach can be better understood by asking the following question: If there is some entry of $C$ that was computed incorrectly, what are the odds that the above algorithm would catch it? Well, if there are $n^2$ entries in $C$ and we choose $n$ of them to check, provided that we pick them uniformly at random, we have a $\frac{n}{n^2} = \frac{1}{n}$ chance of catching the entry that was computed incorrectly. Because asymptotically, $n^2$ runs away from $n$, as our matrices get bigger, our odds of catching a mistake shrinks. That isn’t good! How else might we go about this?…

Consider the following algorithm:

Generate an $n$-bit vector $r$ where each component of the vector is selected uniformly at random from $\{0,1\}$.
Compute $ABr$ and $Cr$.
If $ABr = Cr$, return true, else return false.

Note that step 1 takes $O(n)$ time. Because $ABr = A(Br)$ and $Br$ is just a vector of length $n$, step 2 takes $O(n^2 + n^2) = O(n^2)$ time. The last step takes linear time, so, in total, the above approach takes $O(n^2)$ time — just what we need. But how does it compare to the alternative at finding mistakes?

Does it work?

There are two cases to consider here:

$AB = C$
$AB \neq C$

In the first case, there aren’t any mistakes to catch, so our analysis need only consider the second case. Define $D = C - AB$. What we are formally interested in is the probability that $ABr = Cr \iff (C - AB)r = Dr = 0$ when $AB \neq C \iff C - AB = D \neq 0$. In other words, we are interested in $\Pr[Dr = 0 ~|~ D \neq 0]$. To compute this probability, suppose there was indeed an entry of $C$ that was computed incorrectly. Without loss of generality, assume that $D_{11} \neq 0$. Note that this means that $(AB)_{11}$ and $C_{11}$ are different, so $D_{11}$ is of interest to us — it is the elusive mistake. To better understand what’s going on here, consider the following:

$$\begin{bmatrix}D_{11} & \dots & D_{1n}\\\vdots & \ddots &\\& &\end{bmatrix}\begin{bmatrix}r_1\\ r_2\\ \vdots\end{bmatrix}=\begin{bmatrix}D_{11}r_1 + \dots + D_{1n}r_n\\ \vdots\end{bmatrix}$$

(computation of $Dr$).

The only way the algorithm can be fooled is if, somehow, $D_{11}r_1 + \dots + D_{1n}r_n = 0$. That is, if the first entry of $Dr$ is $0$, we find ourselves in a potential case wherein $Dr = 0$ even though $D \neq 0$ — in English, we find ourselves in a case, when $ABr = Cr$ even though $AB \neq C$. What is the probability of this happening? Some thought suggests that we are looking for the probability that $D_{11}r_1 + \dots +D_{1n}r_n = 0$, or, equivalently, the odds that $r_1 = - \frac{D_{12}r_2 + \dots D_{1n}r_n}{D_{11}}$. Recall that $r_1 \in \{0,1\}$. If that ugly fraction is neither 0 nor 1, we’re good to go because $r_1$ cannot possibly take on that value. If, however, that fraction does equal 0 or 1, then there is a $\frac{1}{2}$ chance that we assigned $r_1$ that value. Thus, $\Pr[Dr = 0 ~|~ D \neq 0] \geq \frac{1}{2}$, which means that our algorithm will catch a mistake at least half of the time.

Conclusion

Isn’t math cool?!?! In lecture, this was used to show that randomization is a powerful tool that allows to do all kinds of mathematically rigorous magic. I was blown away; I hope you were too.