Math on Jack Gindi

Existence vs construction

Wed, 17 Jul 2024 00:00:00 +0000

Introduction

In everyday life, we show that things exist by producing an example. In order to believe that black sheep exist, I would need someone to show me one. To believe the claim that pigs can fly, I would need to actually observe a flying pig. If you can show me the thing, then I will believe it exists.

In mathematics, existence is a more abstract concept. Under certain assumptions about logic, a mathematical object can be proven to exist without ever being explicitly constructed or observed. For instance, we can show that a function $f$ has a root $x_0$ (this means that $f(x_0) = 0$) without knowing the concrete value for $x_0$. This can feel counterintuitive, but in the mathematical world, these two concepts are actually separate, with the task of construction often significantly more difficult than the task of existence.

In this post, we will take a high-level tour through a few examples of problems for which existence is much simpler than construction. Departing from our everyday intuition that existence and construction go hand in hand requires creativity and willingness to think outside the box. Without further ado, let’s jump in!

Testing for Convergence vs. Finding the Sum

One area in which we can readily observe the difference in difficulty between showing that something exists and constructing it concerns infinite sums. Typically, when we’re dealing with infinite sums, there are two questions we care about:

Is the sum finite?
What is the sum? (This question only really makes sense if the answer to (1) is yes.)

The Basel Problem, posed in 1650 by Pietro Mengoli and solved by Euler in 1734, asks for the sum of the infinite series $S = \sum_{i = 1}^n \frac{1}{n^2}$. Showing that $S$ is finite – i.e., that a finite sum exists – simply requires observing that $n$’s exponent in the denominator (2) is greater than 1. Finding the sum however, requires a more elaborate argument which we won’t spend time on here. Spoiler alert: $S = \pi^2 / 6$! What the irrational, transcendental $\pi$ has to do with this very integer-y sum is not obvious, but it’s another story for another post.

Establishing the fact that $S$ is finite just required a simple test. Producing the actual sum, however, took almost 85 years! This is weird! Next, we’ll look at an example from number theory where the difficulty of construction is the foundation upon which cybersecurity rests.

Primality Testing vs. Integer Factorization

One of the most deeply studied parts of number theory – the study of properties of the positive whole numbers – is the prime numbers. A number $p$ is prime if its only divisors are $p$ and $1$. So how do we tell if a number is prime?

The simplest algorithm is to enumerate the numbers between $2$ and $p - 1$. If one of them divides $p$ evenly, the number is not prime. A slightly better way to do it is based on the fact that if $p = ab$, then unless $p$ is a perfect square, then either $a < \sqrt{p}$ and $b > \sqrt{p}$ or vice versa. Thus, instead of searching all the way up to $p$, we can search up to $\sqrt{p}$. But we’re not done! Another observation we can make is that since, at bottom, all numbers are products of primes, we only need to search the primes up to $\sqrt{p}$.

Much more (very sophisticated) work has been done improving these methods (and some of these improvements may at some point be the subject of another post), but the important thing to say here is that we now have algorithms that run relatively quickly for establishing the primality of a natural number. These efficient methods are often unlike the methods I mentioned earlier, since they do not rely on “trial divisions” (where we check different candidate divisors by dividing $p$ by them).

Suppose now that instead of deciding if a number is prime, we change the problem statement slightly to: Find the prime factors of $p$. While it seems like being able to efficiently answer the “Is it prime?” question should require finding the factors along the way, once again, this turns out not to be the case. As of now, there are no published “efficient” integer factorization algorithms, though there are algorithms that are “almost efficient”. (Here, efficiency is a measure of how long it takes to factor an integer $n$ as a function of $n$’s size.)

Many of the cryptographic protocols responsible for securing data on the internet rely on the (likely) computational hardness of factoring numbers efficiently. If an efficient algorithm were to be found, the internet could become a far less safe place to entrust with our credit card information.

The Probabilistic Method

The probabilistic method, often used when studying finite structures, is a very interesting and general technique that manages to cleverly prove that an object with certain (rare) properties exists without actually finding one. The method does this by using randomness in the following ingenious way. First, we construct a “random” instance of the object. Then, we show that the object has the desired property with nonzero probability. Since, if we sample an object at random, the “probability” of drawing the object with the properties we care about is

$$ \frac{\text{\# of configurations with property}}{\text{\# of configurations}}, $$

if the probability is nonzero, it means there must be at least one instance that has our property. Without using the method to actually carry out a proof here, I’ll set a problem up to give a “concrete” sense of a problem we can solve with this.

A graph is a collection of nodes connected to one another by edges. A complete graph is a graph where every node is connected to all of the other nodes. If such a graph has $n$ nodes, then it has $n(n - 1) / 2$ edges. Examples of complete graphs on $n = 2, ..., 7$ vertices are shown below. We typically refer to the complete graph on $n$ vertices as $K_n$.

The question we care about here is: Given a complete graph with $n$ vertices ($K_n$) and an integer $r$ (where $r < n$), is it possible to color each edge either red or blue in such a way that no group of $r$ vertices ($K_r$) has monochromatic connecting edges?

(Read that again if you need to.)

The proof begins by constructing a “random” graph where each edge is colored red or blue independently at random. Using the fact that the graph is random (in the edge-coloring sense) and skipping over some details, it turns out that if for a particular choice of $n$ and $r$,

$$ \frac{n!}{r!(n-r)!} 2^{1 - r(r - 1) / 2} < 1, $$

then such an edge-coloring exists. In other words, given our initial question with some specific values of $n$ and $r$, you can determine if such an edge-coloring exists just by plugging those values into the left side of the inequality and seeing whether the result is smaller than 1. Note, however, that this plug-and-chug way of answering the question gives us no information about how to color the edges to see the actual coloring!

Here, again, we have some clever way to answer the existence question while still being at square zero in terms of how we might go about construction.

For this and other combinatorial problems, one can imagine sitting down and drawing some small examples to try to gain an intuition for what a valid coloring might look like. Maybe you draw a few examples for $n=5$ vertices (ten edges) or $n = 10$ vertices (45 edges). Then you pick a few values of $r$ and realize (if you’re lucky and persistent) that a few values of $r$ work while others don’t, and you start to feel like getting intuition for this problem from examples might be more difficult than you thought. The probabilistic method allows us to sidestep the problem of construction and instead prove existence in a way that highlights the surprising power of abstract reasoning in tackling what are otherwise mind-crushingly complex combinatorial problems.

Is addition commutative?

Fri, 14 Jul 2023 00:00:00 +0000

Introduction

This will be a quick one. From the very beginning of our mathematical educations, there is a fundamental fact of which we are made aware: If $a$ and $b$ are numbers of any kind (natural numbers, integers, rationals, reals, complex), then – now hold on to your hats –

$$ \begin{align*} a + b = b + a. \end{align*} $$

Earth-shattering – I know. As some of you are likely already aware, I find infinity fascinating. In this post, I want to briefly discuss how infinity can mess with some of our most basic assumptions about the nature of one of our most basic arithmetic operations. Without further ado, let’s dive in.

Infinite series

In our everyday lives (unless you’re a mathematician), we only ever consider what it means to carry out arithmetic operations on finite collections of numbers. A first course in calculus challenges us to think about the nature of infinite sums, objects like:

$$ \begin{align*} 1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + \dots \end{align*} $$

$$ \begin{align*} 1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \dots \end{align*} $$

But what – you might ask – is so interesting about these sums? Don’t an infinite number of numbers added together have to add up to $\infty$?

(Mathmatical aside – feel free to skip: Before answering that question, it is worth considering what mathmeticians even mean when they ask about the sum of an infinite number of terms. We could never actually add up infinitely many terms, so instead, the “sum” of an infinite series is the limiting value of the sequence of the series' partial sums. That is for the second series, we want to know the limiting value of the sequence $1, 1 + \frac{1}{2}, 1 + \frac{1}{2} + \frac{1}{4}, \dots$.)

It turns out that some infinite series have finite sums (like the second one) and others do not (like the first one). There are lots of rules and tests that one can perform to determine what kind of series one is looking at, but for our purposes, it suffices to know that we call the ones with finite sums convergent and the ones whose sums are infinite divergent.

The harmonic series diverges

The first series we looked at earlier is so famous that it has a name: the Harmonic Series. It is an example of a series whose underlying sequence has terms that get smaller and smaller, but which, despite that fact, diverges. For full effect, let’s briefly have a look at why it diverges.

Let $H$ denote the sum of the Harmonic Series, and let’s compare $H$ to the sum of another, similar-looking series $H'$:

$$ \begin{align*} H &= 1 + \frac{1}{2} + \biggr(\frac{1}{3} + \frac{1}{4}\biggr) + \biggr(\frac{1}{5} + \frac{1}{6} + \frac{1}{7} + \frac{1}{8}\biggr) + \dots \\ H' &= 1 + \frac{1}{2} + \biggr(\frac{1}{4} + \frac{1}{4}\biggr) + \biggr(\frac{1}{8} + \frac{1}{8} + \frac{1}{8} + \frac{1}{8}\biggr) + \dots \end{align*} $$

Notice that (1) each of the parenthesized groups of terms in $H'$ sum to $1/2$, and (2) each of the corresponding groups of terms in $H$ sums to a number that is greater than $1/2$. One of the rules one learns in calculus is (informally) that if you are comparing two sums and the smaller of them is infinite, the larger one must be infinite too. Since $H > H'$ because of (1), and $H' = \infty$ because it is a sum of infinitely many $1/2$s, $H$ must be infinite too.

Can we make it converge?

Even though $H$ diverges, we can make it converge by changing half of the additions into subtractions like so:

$$ \begin{align*} 1 - \frac{1}{2} + \frac{1}{3} - \frac{1}{4} + \frac{1}{5} - \frac{1}{6} + \frac{1}{7} - \frac{1}{8} + \dots \\ \end{align*} $$

It turns out that this modified series converges! (We will not spend time proving that this series converges, but if you’re interested, I wrote another post a while ago covering that; you can check it out here.) To make things even weirder, if you group the terms with even denominators together, the sum of their absolute values diverges. (The same applies to the group of terms with odd denominators, as you might expect.)

For visual intuition that this series converges (in lieu of a proof), this image should help (source):

The image shows that it converges to approximately 0.694. I claim that I can make it converge to whatever value I want…

But how?

Other arrangements

Let’s say that I wanted the sum of the terms of the alternating harmonic sequence to be 2. Carry out the following algorithm:

Add positive terms together until we exceed 2 (we would start with $1 + 1/3 + \dots + 1/15 \approx 2.02$)
Add negative terms to the output of step 1 until we were below 2 (we would then add $-1/2$, so we’d end up at around $1.72$)
Add positive terms to the output of step 2 until we exceed 2 (starting with $1/17$, an so on)
Add negative terms until we fell below 2
etc.

Why will this sequence of numbers (2.02, 1.72, …) produced by the algorithm I proposed converge to 2?

As we mentioned earlier, the absolute values of terms in both positive and negative groups both sum to $\infty$, so we will always have enough terms that we haven’t used that can get us above or below 2 when we need to.
The sizes of the successive terms from each group that we use are getting smaller and smaller, so at each step, the amount by which we exceed and fall below 2 shrinks.

Taken together, we see that will oscillate around 2, and successive oscillations will be smaller and smaller; that’s simply a fancy way of saying that our series now converges to 2. But didn’t it just converge to 0.694?

This is a quirk of trying to sum infinitely many terms. When finitely many terms are involved, rearranging terms doesn’t affect anything. There are even many infinite sums for which the same is true (our second one from earlier, for example). With infinity, though, it is always important to reexamine facts that are obvious or trivial-seeming in finite territory. We’ve just shown that under special conditions, the infinite version of $a + b$ is not necessarily the same as $b + a$.

The St. Petersburg paradox

Wed, 18 May 2022 00:00:00 +0000

In this post, I want to talk through a simple mathematical result that forces us to think twice about relying too heavily on averages.

Quick review of expected values

Given an opportunity to play a game with uncertain outcomes, one reasonable way to value to opportunity is to weight each possible reward by its probability of occuring and sum up the results. This quantity is called the expected value, or expectation, of the game. (Note that a simple average is an expected value with equal weight on each outcome.) Another way of saying this is that if you have to pay a fee to play this game, the fee you should be willing to pay is the game’s expected value (or less, of course). Mathematically, if the payoff of the uncertain game is represented by the random variable $X$, the possible outcomes are $x_1,x_2,\dots,x_n$, and the corresponding probabilities of the outcomes are correspondingly $p_1,p_2,\dots, p_n$, then the value of playing the game would be given by

$$E[X] = x_1p_1 + \dots + x_np_n = \sum_{i=1}^n x_i p_i.$$

(The notation $E[X]$ is how we represent the expected value of $X$.) For many bets, this approach has intuitive appeal, and there are many scenarios in which it is used directly.

Flipping coins

With this in mind, consider the following game. You flip a coin until seeing a head. If a head falls on the $k$th flip, you win $2^k$ dollars. The question is: how much would you be willing to pay to play this game?

Well, you might begin, there are infinitely many possible outcomes (head after 1, head after 2,…). Each outcome requires flipping $k-1$ tails and then a single head. Using a fair coin, we have

$$ P(\text{game ends on flip}~k) = \frac{1}{2^{k-1}}\frac{1}{2} = \frac{1}{2^k}. $$

Using the framework of outcome values and probabilities of those outcomes, we can express $x_i$ and $p_i$ for each $i=1,2,\dots$ as

$$ \begin{align*} x_i &= 2^i\\ p_i &= \frac{1}{2^i}. \end{align*} $$

But wait! If this is the case, then for each $i$, $x_ip_i = 2^i/2^i = 1$, so the expected value of the game is actually infinite (the sum of infinitely many 1s)! It seems that, according to this mathematically sound analysis, that we should be willing to participate in this game at any offered price. With this information, how much would you pay to play this game?

What gives?

If you think something is fishy here, you’re right. This problem is so well-known it has a name: the St. Petersburg paradox. It isn’t actually a paradox, but the use of the word refers to the fact that on the one hand, this game has infinite expected value, but on the other, the probability of making a large sum of money is vanishingly small. To put a finer point on it, the probability of winning more than $2^k$ dollars is $1/2^k$! For even moderate values of $k$, this probability is miniscule.

This sort of issue has caused many to reject the exclusive use of expected value as a valuation technique. Some suggest adding a measure of risk (as is customary in any sort of financial application), some advocate for use of the median instead, and still others advocate using the expected value of some utility function applied to the outcomes, rather than the outcomes themselves. There are all interesting areas to explore, but the bottom line is that expectations alone can lead to some head-scratching issues.

Finite resources

One other aspect of the problem to which some attribute the counterintuitive result is the implicit assumption we made that the banker or casino has infinite wealth with which to bankroll the game. Let’s see what happens if the banker only has finite wealth. That is, let’s now suppose that the banker has $W$ (for wealth) dollars with which to fund the game. We’ve introduced a ceiling on the number of rounds before the game ends: $L = \lfloor \log_2 W \rfloor$ (i.e., the number of times you can double the payout before exceeding $W$, rounded down). Having reframed the problem this way, the expected value calculation returns a saner result:

$$ E[X] = \sum_{i=1}^L 2^i \frac{1}{2^i} = \sum_{i=1}^L 1 = L. $$

That is, the expected value of the game is now logarithmic in the banker’s wealth. This means, for example, that if the banker has one billion dollars, the game is only worth about \$30. This accords with our intuition about small probabilities of winning anything significant.

Conclusion

Expected value is one of, if not the most important tool/concept in all of probability! Even so, as we’ve shown in this post, it is not a panacea. If you’re not careful, strange (and fascinating) things might happen.

Linear interpolation in one and two dimensions

Thu, 02 Sep 2021 00:00:00 +0000

Introduction

In this post, I want to demonstrate how helpful visual intuition can be. To do this, we are going to think about how to extend a technique called linear interpolation from one dimension to two. Loosely speaking, techniques for interpolation allow us to use information that we know to hopefully make reasonable estimates of quantities we don’t know. In the rest of this post will first discuss linear interpolation in one dimension, and then use some pictures to figure out what it would mean to linearly interpolate in two dimensions.

Linear Interpolation: 1D

Suppose that you have two points $(x_1, f(x_1))$ and $(x_2, f(x_2))$ and a value $x_1 \leq x \leq x_2$ whose corresponding value $f(x)$ we want to estimate. The first think you might think to do is to assume that $f$ is linear. You would then find the slope $m$ and intercept $b$ of the line connecting the points $(x_1, f(x_1))$ and $(x_2, f(x_2))$, and then use that line to estimate that $f(x) = mx + b$. This is shown visually in the figure below.

It turns out that by rearranging the expression $f(x) = mx + b$ (with $m$ and $b$ expanded as shown in the figure), we can actually express $f(x)$ in a different way:

$$ f(x) = \theta f(x_2) + (1 - \theta)f(x_1), $$

where $\theta = \frac{x - x_1}{x_2 - x_1}$ is the fraction of the total distance between $x_1$ and $x_2$ that is between $x$ and $x_1$. This formulation furnishes another way to think about what linear interpolation does: it estimates $f(x)$ by mixing some amount of $f(x_1)$ with some amount of $f(x_2)$. The amounts of each that are used depends on how close to $x_1$ (or $x_2$) $x$ lies. (To be precise, how much of $f(x_1)$ we use actually depends on the size of the distance between $x$ and $x_2$. As $x$ moves further from $x_2$, the coefficient on $f(x_1)$ should get bigger.)

Linear interpolation: 2D

Now suppose that instead of $(x, f(x))$ pairs, we have $((x, y), f(x, y))$ pairs. Whereas in the prior section, the domain of $f$ is the set of real numbers, in this section, the domain is actually points in the plane. The setting for interpolation in two dimensions is that we have four points in the plane $(x_1, y_1)$, $(x_1, y_2)$, $(x_2, y_1)$, and $(x_2, y_2)$ whose $f$ values we know. We are then given another point, $(x, y)$, and we are trying to estimate the value of $f(x, y)$ (again assuming that $f$ is linear). This setup is shown graphically below.

In this scenario, $f$ actually defines a surface (shown in gray), rather than a curve. In order to estimate the value of $f(x,y)$, we want to come up with a formula for linear interpolation in two dimensions. There are various ways to derive the formula for this**, but here I want to discuss one that I think has an elegant and very intuitive visual interpretation. It turns out that we can borrow the mixture idea from the the 1D case, but instead of a mixture based on the distances along a line, we are going to use areas of subrectangles.

The key here is that we are using areas here as a proxy for 2D “distance”. To sanity check this intuition, note that if $(x, y)$ is one of our four known points, say $(x_1, y_1)$, the area of the subrectangle corresponding to it will be equal to the total area of the larger rectangle, using this method, we can easily see that in this case, $$ f(x, y) = \frac{(x_2 - x_1)(y_2 - y_1)}{(x_2 - x_1)(y_2 - y_1)} \cdot f(x_1, y_1)

0\cdot f(x_2, y_1) + 0 \cdot f(x_1, y_2) + 0 \cdot f(x_2, y_2) = f(x_1, y_1), $$ as we expect.

This intuition can be extended to an arbitrary number of dimensions. In 3D, for instance, we would use areas of 3D sub-rectangular prisms rather than subrectangles.

Conclusion

This isn’t an especially deep idea from a mathematical standpoint, but I thought that it was a nice illustration of how sometimes, visual intuition can take us a very long way. If you’re ever trying to solve some challenging problem and you don’t know where to start, drawing some pictures might be a great way to get the juices flowing.

**The usual, and sort of messy, way to derive the formula for bilinear interpolation is to first interpolate in one of the variables and then the other. Just to give a sense for the way that this get’s pretty cumbersome, we will briefly show how to do it. First we compute $f(x_1, y)$ and $f(x_2, y)$

$$ \begin{align} f(x_1,y) &= \frac{y-y_1}{y_2 - y_1}f(x_1, y_2) + \frac{y_2 - y}{y_2 - y_1}f(x_1, y_1)\\ f(x_2,y) &= \frac{y-y_1}{y_2 - y_1}f(x_2, y_2) + \frac{y_2 - y}{y_2 - y_1}f(x_2, y_1) \end{align} $$

Then we compute $f(x,y) = \frac{x-x_1}{x_2 - x_1} f(x_2, y) + \frac{x_2-x}{x_2 - x_1}f(x_1, y)$, plugging in (1) and (2). In my opinion this in more dimensions quickly becomes unwieldy, and the intuition becomes less clear the more dimensions you try to think about.

Finding eigenvalues

Mon, 08 Mar 2021 00:00:00 +0000

Introduction

Over the past few months, I’ve been working on some optimization-related projects at work. Making optimization algorithms efficient and effective often comes down to command of numerical linear algebra, otherwise known as the intersection of linear algebra and computers. It is one thing to discover an algorithm for certain problems that works well in the ether. It is another entirely to ensure that the algorithm works well once it violently collides with the physics of finite precision computers. As someone who has come to deeply appreciate the power of mixing elegance and implementation, I decided to delve more deeply into the subject by making my way through Numerical Linear Algebra by Trefethen and Bau.

This post works through one of the chapters about developing an algorithm to find the largest eigenvalue and its corresponding eigenvector of a symmetric positive definite matrix $A$.

Review of eigenvalues and eigenvectors

Eigenvalues and eigenvectors are central in applied linear algebra. They have applications across machine learning, communication systems, mechanical engineering, optimization, and many other disciplines. One particularly important application of eigenvalues to our everyday lives is search engines! In fact, Google’s PageRank algorithm (or at least the initial algorithm), is all based on eigenthings. For a great explanation of the original conception of using PageRank to organize the internet, check out the original paper by Sergey Brin and Larry Page. As my good friend Ben reminded me “Eigenvectors power our internet!”

In essence, an eigenvector $v$ of a matrix $A$ is a vector that is exclusively stretched (not rotated) when acted upon by $A$. An eigenvalue $\lambda$ of $A$ that corresponds to $v$ is the stretch factor. Formally, $v$ is an eigenvector of $A$, with corresponding eigenvalue $\lambda$ if we have

$$Av = \lambda v.$$

For the rest of the post, we will assume we’re dealing with a symmetric positive definite matrix.

A helpful characterization

Our first step will be to develop a helpful characterization of eigenvalues. To do this, given a matrix $A$ and a nonzero vector $x$, we consider the problem

$$\text{minimize}_\alpha ~~ \| Ax - \alpha x\|_2^2.$$

We are essentially trying to find a scalar that is the closest approximation we can find to an eigenvalue corresponding to the vector $x$, i.e. an $\alpha$ such that $Ax \approx \alpha x$. We can easily solve this minimization problem by setting the derivative of the objective (w.r.t. $\alpha$) to 0 and solving for $\alpha$. Carrying this out, as a function of $x$, we get:

$$ \alpha(x) = \frac{x^TAx}{x^Tx} $$

What we are interested in are the critical points of $\alpha(x)$ as a function of $x$. Using the vector analog of the quotient rule for taking derivatives, we have

$$ \begin{align*} \nabla_x \alpha(A, x) &= \frac{2Ax}{x^Tx} - \frac{(2x)(x^TAx)}{(x^Tx)^2} \\\\ &= \frac{2}{x^Tx}\biggr(Ax - \biggr(\frac{x^TAx}{x^Tx}x\biggr)\biggr) \\\\ &= \frac{2}{x^Tx}(Ax - \alpha(x)x). \end{align*} $$

Suppose $v$ is a critical point of $\alpha$, i.e. $\nabla \alpha(v) = 0$. For that $v$, we would have $Av = \alpha(v)v$. That is, $v$ is an eigenvector of $A$, with eigenvalue $\alpha(v)$. Conversely, if $v$ is an eigenvector of $A$ with corresponding eigenvalue $\lambda$, then we have $r(v) = \lambda \frac{x^Tx}{x^Tx} = \lambda$. We’ve now shown that the vectors $v$ that make the derivative 0 are exactly the eigenvectors of $A$. For each of those eigenvectors, $\alpha$ produces the corresponding eigenvalue.

This characterization of eigenvalues and eigenvectors is important because it gives us an iterative way to think about these mathematical objects with a definition that is more amenable to computation. The function $\alpha$ is important enough that is has a name, the Rayleigh quotient, and it is crucial to our development of the algorithms below.

Thus far, given an arbitrary vector, we’ve found a way to come up with an eigenvalue-like scalar that corresponds to it. Intuitively order to find a bona fide eigenvalue of $A$, we have to iteratively nudge our initial eigenvector estimate toward eigenvector-hood. As our initial guess tends toward an eigenvector $v$, $\alpha(v)$ tends toward an eigenvalue of $A$.

Power iteration

Power iteration is not our destination but it is a conceptual building block that we will spend a moment on here. Ultimately, it has some limitations, but its ideas will help us later.

The algorithm

The algorithm finds the largest eigenvalue and corresonding eigenvector of a matrix $A$. To do this, it starts with an arbitrary vector $v_0$ and computes $v_i = Av_{i-1}$, for $i = 1,\dots, m$ (normalizing each of the results). It then uses the estimate $v_i$ to compute our $i$th estimate of $\lambda_i$ by computing $\alpha(v_i) = v_i^TAv_i$ (no denominator because $|| v_i || = 1$). As we will show momentarily, as $i \to \infty$, $\lambda$ converges to the largest eigenvalue $\lambda_1$ of $A$ and $v$ converges to an eigenvector $v_1$ corresponding to $\lambda_1$. Before we prove anything, here is code that implements power iteration in the Julia programming language.

function power_iteration(A, iters)
 m = size(A, 1)
 v = rand(m)
 v = v / norm(v, 2)
 λ = 0.
 for i = 1:iters
 # update v
 u = A * v
 v = u / norm(u, 2)
 # Rayleigh quotient
 λ = v' * A * v
 end
 return λ, v
end

In essence, what we’ve provided is a way of finding the largest eigenvalue and its eigenvector beginning from a crude estimate of the eigenvector.

Why does it work?

To show that the sequences of iterates converge in the way that we claimed, we just need to show that the sequence $v_i$ converges to an eigenvector of $A$ ( because we’ve already shown that given an eigenvector, $\alpha$ produces an eigenvalue).

Let’s say that $\{q_i\}$, $i = 1,\dots,m$, make up an orthogonal basis of eigenvectors of $A$ corresponding to the eigenvalues $\lambda_i$ (this set exists because $A$ is symmetric). We can also assume, without altering the proof, that $|\lambda_1| > |\lambda_2| \geq \dots \geq |\lambda_m|$. Because $v_k = c_kA^kv_0$ for some sequence of constants $c_k$ (because of the normalization at each step), we can use the expansion of $v_k$ in the basis $\{q_i\}$ as

$$ \begin{align*} v_k &= cA^kv_0 \\\\ &= cA^k(a_1q_1 + \dots + a_mq_m)\\\\ &= c(a_1\lambda_1^kq_1 + \dots a_m\lambda_m^kq_m)\\\\ &= c\lambda_1^k(a_1q_1 + a_2(\lambda_2/\lambda_1)^kq_2 + \dots a_m(\lambda_m/\lambda_1)^kq_m) \end{align*} $$

Because $\lambda_1$ is greater than all other eigenvalues, as $k \to \infty$, all but the first of the terms in the parentheses in the last line go to zero, so we have $v_k \to c_ka_1\lambda_1^kq_1$, which is a scalar multiple of $q_1$, the eigenvector of $A$ corresponding to the largest eigenvalue. (We do not need to worry about the sign of the constants; the important thing is that the one-dimensional subspace spanned by the $v_k$ is the same as the one spanned by $q_1$.)

Limitations

Unfortunately, power iteration is not really used in practice for a couple of reasons. The first is that it only finds the eigenpair corresponding to the largest eigenvalue. The second is that the rate at which it converges depends on how much larger $\lambda_1$ is than $\lambda_j$ for $j > 1$. If, for instance, $\lambda_1$ and $\lambda_2$ are close in magnitude, the convergence is very slow. There is a modification we can make to mitigate some of these issues, which we’ll discuss next.

Inverse iteration

Let’s see if we can find a better, more reliable way to find these eigenvectors. Suppose that $\mu$ is a scalar that is not an eigenvalue of $A$ and let $v$ be an eigenvector of $A$ with associated eigenvalue $\lambda$. We can show that $v$ is also an eigenvector of $(A - \mu I)^{-1}$ by

$$ \begin{align*} (A - \mu I)v &= Av - \mu v \\\\ &= (\lambda - \mu) v \end{align*} $$

If we multiply on the left by $(A - \mu I)^{-1}$ and then divide on both sides by $\lambda - \mu$, we have $(A - \mu I)^{-1}v = v / (\lambda - \mu)$. In other words, $1/(\lambda - \mu)$ is an eigenvalue of $(A - \mu I)^{-1}$. (The invertibility of $A - \mu I$ follows from the fact that $\lambda_i - \mu \neq 0$ for each $i$.)

(You might be thinking: “What if $\mu$ is exactly equal to or very close to an eigenvalue of $A$?” While we won’t go into detail here, it turns out that these cases doesn’t really cause additional computational issues.)

What’s nice about all this through is that if we take $\mu$ to be a reasonable estimate of one of the eigenvalues $\lambda_i$ (more on this in a bit), then we will have $(\lambda_i - \mu)^{-1}$ much larger than $(\lambda_j - \mu)^{-1}$ for $j \neq i$. We can thus conduct power iteration on $(A - \mu I)^{-1}$ and converge very quickly to an eigenvector of $A$ – essentially because we’ve magnified the difference between one eigenvalue of $A - \mu I$ and the rest. Before we move on, here is code (in Julia) that carries out inverse iteration:

function inverse_iteration(A, μ, iters)
 m = size(A, 1)
 v = rand(m)
 v = v / norm(v, 2)
 # compute the matrix we want to invert for reuse. we could also factor
 B = A - μ * I(m)
 for i = 1:iters
 # applying the inverse matrix is same as solving system
 w = B \ v
 # normalize
 v = w / norm(w, 2)
 end
 return v
end

At this point, we have a way of turning vectors into reasonable eigenvalue estimates (the Rayleigh quotient) and a reasonable way of turning eigenvalue estimates into eigenvectors (inverse iteration). Can we combine these somehow? The answer is yes, and this discussion is the final leg of our journey.

Rayleigh quotient iteration

We can put the two algorithms together by repeating two operations:

Use an inverse iteration step to refine our estimate of the eigenvector using the latest estimate of $\lambda$.
Using Rayleigh quotient to turn the refined eigenvector estimate into a refined eigenvalue estimate.

As the eigenvalue estimate $\mu$ becomes better, the speed of convergence of inverse iteration increases, so that this natural combination yields our best algorithm yet. Without detailing the convergence proof, this algorithm converges extremely quickly: on every iteration, the number of digits of accuracy on the eigenvalue estimate triples!

Here is the code in Julia:

function rayleigh_iteration(A, iters)
 m = size(A, 1)
 v = rand(m)
 v = v / norm(v, 2)
 λ = v' * A * v
 for i = 1:iters
 # inverse iteration step
 w = (A - λ * I(m)) \ v
 v .= w ./ norm(w, 2)
 # Rayleigh quotient
 λ = v' * A * v
 end
 return λ, v
end

Conclusion

In this post, we went through a couple of different algorithms that help us find eigenvalues and eigenvectors. While we typically first learn that eigenvalues and eigenvectors should be thought about in the context of characteristic polynomials and determinants, it turns out that for both theoretical (due to Abel) and computational (ill-conditioning of root-finding algorithms) reasons, an iterative approach is actually required for finding them in practice.

In addition to wanting to cement my understanding of these algorithms as well as possible before moving to the next lecture in the textbook, I thought this was a cool case of different approaches combining their strengths to yield an algorithm more effective than the individual parts.

Thanks for reading!

QR factorization

Sun, 27 Dec 2020 00:00:00 +0000

Introduction

This post was inspired by some conversations with my brother about concepts from linear algebra. I’m writing it mostly to better understand the idea myself, but hopefully some others will find it clear and useful too.

Generally speaking, a matrix decomposition usually starts with a matrix $A$ and asks how we can decompose it into some other matrices that have convenient properties. These algorithms are often useful in making certain common matrix-related tasks, such as solving (potentially large) systems of linear equations quickly much more computationally efficient. You can find a long list of decompositions here, but in this post we’re going to talk about one particular decomposition (or factorization): the QR factorization.

We’ll begin by showing why the decomposition comes in handy for solving linear equations. Once we’ve convinced ourselves that the decomposition is useful, we will then discuss how we go about finding the components that the decomposition finds for us.

What the $QR$-factorization does

Simply put, the $QR$ factorization algorithm takes as input a matrix A and outputs a pair of matrices, $Q$ and $R$, such that $Q$ is orthogonal and $R$ is upper triangular (which means that the elements of the matrix below the diagonal are all 0).

Let’s suppose that we’re interested in solving a system of linear equations given compactly in matrix-vector notation by $Ax = b$, where $A \in \mathbf{R}^{n\times k}$, $x \in \mathbf{R}^k$, and $b \in \mathbf{R}^n$. We will suppose that $A$ is tall or square and that it has full column rank (its columns are an independent set). We will also assume that $b$ is in the range of $A$, so that a solution exists. The variable here is $x$; that is, we want to find $x_1, \dots, x_k$ that simultaneously satisfy all of the equations in the system.

Now let’s magically decompose $A$ into the product of matrices $Q$ and $R$, where $Q$ is orthogonal and $R$ is upper triangular. Then we can rewrite the system we want to solve as $QRx = b$. We can solve the system in 3 steps:

Solve the system $Qz = b$ by left multiplying both sides by $Q^T$ (benefit of $Q$ being orthogonal).
Solve the system $Rx = z$. If you think about what it means for $R$ to be upper triangular, we can solve this part by first obtaining $x_k$ by \begin{equation} x_k = \frac{z_k}{R_{kk}}, \end{equation} then obtaining $x_{k-1}$ by \begin{equation} x_{k-1} = z_{k-1} - z_k\frac{R_{k-1,k}}{R_{kk}R_{k-1,k-1}}, \end{equation} then obtaining $x_{k-2}$, and so on, until we’ve computed all of the $x_i$.

This turns out to be much more efficient than solving the system in the naive way (e.g., computing this and left multiplying the original system by it).

The other great advantage of computing a factorization like the $QR$ factorization is that it can be cached and reused! Suppose that instead of a single system, you want to solve 1000 systems, with with different right hand sides (e.g., $Ax = b_1$, $Ax = b_2$, …, $Ax = b_{1000}$. After computing the factorization once, you can reuse it to make all 1000 solves more efficient! In some sense, you can amortize the cost of the factorization over a bunch of reuses. In real applications that I’ve been a part of developing, using matrix decompositions in this way has resulted in noticeable and impactful speedups.

Finding $Q$ and $R$

So how do we find $Q$ and $R$?

There are a few ways to compute $Q$ and $R$, but in this post I want to walk through the most intuitive of them: the Gram-Schmidt algorithm (GS).

We will describe GS when the input is a linearly independent set of $n$-vectors $a_1,\dots,a_k$ (this implies $k \leq n$). The general idea of the algorithm is that at the $i$th step we construct a vector $q_i$ using $a_i$ as a starting point and removing from it everything that $a_i$ shares with the vectors we’ve already computed in prior steps, i.e., $q_1$ through $q_{i-1}$. By construction, $q_i$ doesn’t have anything in common with any of the vectors computed before it, so the collection $\{q_1,\dots,q_k\}$ are orthogonal. If we divide each vector by its length at each step, the orthogonal collection becomes orthonormal. The vectors $q_i$ become the columns of $Q$.

In order to compute $R$, we need to make the idea of “removing” everything that $a_i$ shares with the vectors we’ve already computed in prior steps" more precise. Suppose we are part of the way through the algorithm. As we’re preparing for the $i$th step, we have the vector $a_i$ that we need to incorporate into our output and the orthonormal collection $q_1,\dots,q_{i-1}$ that we’ve built up so far. For some $1 \leq k \leq i-1$, let $v_k = (q_k^Ta_i)q_k$. If we subtract $v_k$ from $a_i$, let’s see what the result “has in common” with $q_k$ by taking the inner product:

$$ \begin{align*} (a_i - v_k)^Tq_k &= (a_i - (q_k^Ta_i)q_k)^T q_k \\\\ &= a_i^Tq_k - q_k^Ta_i \\\\ &= 0 \end{align*} $$

This means $a_i - v_k$ has nothing in common with, or in math parlance, is orthogonal to, $q_k$! Thus, to make $q_i$ orthogonal to all of the $q_k$, we just set $q_i = a_i - v_1 - v_2 - \dots - v_{i-1}$. The GS algorithm can thus be stated compactly:

For $i = 1,\dots,k$, let $p_i = a_i - v_1 - \dots - v_{i-1}$. Then define $q_i = p_i / ||p_i||$. When you’ve cycled through all values of $i$, return the collection $q_1,\dots,q_k$.

With this in hand, we can now define the entries of $R$. If $p_i = a_i - v_1 - \dots - v_{i-1}$, then we can isolate $a_i$ and obtain

$$ \begin{align*} a_i &= \|p_i\|p_i + v_1 + \dots + v_{i-1} \\\\ &= \|p_i\|p_i + (q_1^Ta_i)q_1 + \dots + (q_{i-1}^Ta_i)q_{i-1}. \end{align*} $$

We will choose $R_{ij} = q_j^Ta_i$ for $ij$; in essence, we’re just picking our entries of $R$ right out of the expression for $a_i$ in terms of the $q_i$.

By defining $Q$ and $R$ as such, we have $A = QR$, as we wanted.

Conclusion

The $QR$ factorization is very useful without being overly abstract. Ultimately, the insight that makes it possible is very intuitive (even though many symbols were harmed in the writing of this post). The method described above is by no means the only way to compute $QR$-factorization. I may go through some of the others in future posts.

An algorithm like the one we considered in this post is one of the most satisfying things about working with and studying mathematics; one moment, you’re thinking about linear independence and orthogonality, and the next you’ve got a very useful, practical algorithm.

(Note: The exposition of the algorithm in this post is inspired by that of this book by Boyd and Vandenberghe.)

Fibonacci with linear algebra

Tue, 27 Oct 2020 00:00:00 +0000

Introduction

After writing a post about one interesting way to arrive at a closed-form for the $n$th term of the Fibonacci sequence, a friend pointed out a few alternative ways to get there, one of which felt particularly natural. It requires some linear algebra, so I guess that in some sense, it could be considered “unnatural,” but I especially like it because the argument’s flow requires fewer arbitrary-seeming leaps. With that said, this post will be brief (as I’m a little busy at the moment), so if there’s anything that doesn’t make sense or is incorrect, please do reach out and let me know.

Fibonacci matrix

The Fibonacci sequence can be viewed through the lens of matrices. In particular, if we start with the matrix

$$ A = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}, $$

we can see that the Fibonacci sequence can be materialized by repeatedly multiplying $A$ by itself.

To see this, first notice that if we are considering the Fibonacci sequence that starts with $F_0 = 0$ and $F_1 = 1$, then we note that

$$ A = \begin{bmatrix} F_2 & F_1 \\ F_1 & F_0 \end{bmatrix}. $$

Suppose now that

$$ A^{n-1} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^{n-1} = \begin{bmatrix} F_{n-1} & F_{n-2} \\ F_{n-2} & F_{n-3} \end{bmatrix} $$

for $n \geq 3$. Then

$$ A^n = A^{n-1}A = \begin{bmatrix} F_{n-1} + F_{n-2} & F_{n-2} + F_{n-3} \\ F_{n-2} + F_{n-3} & F_{n-2} \end{bmatrix} = \begin{bmatrix} F_n & F_{n-1} \\ F_{n-1} & F_{n-2} \end{bmatrix}. $$

Thus, to get the $n$th element of the Fibonacci sequence, we need only get the upper left entry of $A^n$. If we had a fast way to obtain $A^n$ instead of actually carrying out iterated matrix multiplication, we could obtain the $n$th element without doing very much work.

Diagonalizing $A$

To do this, we will look to diagonalize $A$. This means that we will try to write

$$ A = VDV^{-1} $$

where $D$ is a diagonal matrix and $V$ is a matrix with eigenvectors of $A$ as columns (we have to actually find $D$ and $V$ that work). What’s nice about a diagonal representation is that

$$ A^n = (VDV^{-1})^n = VDV^{-1}VDV^{-1}\dots VDV^{-1} = VD^nV^{-1}. $$

If $D$ is written as a matrix with $\lambda_1,\dots,\lambda_n$ on the diagonal, then $D^n$ is simply $D$ with the elements on the diagonal each taken to the $n$th power, like so:

$$ D = \begin{bmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{bmatrix} \implies D^n = \begin{bmatrix} \lambda_1^n & 0 \\ 0 & \lambda_2^n \end{bmatrix}. $$

Now that we’ve laid out our approach, it’s time to carry it out. (I will spare the algebra required to keep this post brief.)

Using $A$’s eigenvalues

The eigenvalues of $A$ (as you might expect if you read the earlier post about Fibonacci) are $\lambda_1,\lambda_2 = \frac{1 \pm \sqrt 5}{2}$, and the corresponding eigenvectors are $v_1, v_2 = \begin{bmatrix} \frac{1 \pm \sqrt 5}{2} \\ 1 \end{bmatrix}$. Thus, we have

$$ D = \begin{bmatrix} \frac{1 + \sqrt 5}{2} & 0 \\ 0 & \frac{1 - \sqrt 5}{2} \end{bmatrix} ~~~~ V = \begin{bmatrix} \frac{1 + \sqrt 5}{2} & \frac{1 - \sqrt 5}{2} \\ 1 & 1\end{bmatrix}. $$

With these in hand, we see that $A$ is indeed diagonalizable, so that $A^n$ can be written as $VD^nV^{-1}$, and voila! with two matrix multiplications and exponentiating the diagonal entries of $D$, we can very quickly and efficiently come up with large Fibonacci numbers quickly.

Fibonacci with difference equations

Tue, 13 Oct 2020 00:00:00 +0000

Introduction

The Fibonacci sequence is a mathematical object that has garnered much fascination over the centuries due to its simplicity and knack for rearing its head in all sorts of unexpected places. In this post I want to lead you to a different way of thinking about the sequence that is wholly unintuitive. You can only really arrive at it by using algebra in conjunction with some lucky guesses, as we’ll see. But before we dive into all that, let’s start with a bit of background.

At the beginning of the 13th century, Fibonacci first wrote about the eponymous sequence while pondering a sensible way to model the evolution of a rabbit population over time. As we will discuss below, the definition of the sequence is delightfully simple, yet it seems to show up in a bunch of different places, including:

The way that branches of trees propagate in nature.
Numbers of flower petals on certain kinds of flowers.
The number of possible ancestors on a human $X$-chromosome.
Any place you’ve heard about the golden ratio $\Phi$.

It is an object that is a delicious blend of simple and deep. So simple to play around with, in fact, that the reason I thought to write this post was actually the result of an interview that took an interesting (and fun!) turn.

One of my colleagues asked a candidate with some technical expertise to write some code that would do something with Fibonacci numbers. In response, the candidate cited this crazy-looking formula to my colleague without justification. After the interview, my colleague came back to our bay of desks interested as to whether I had ever seen the formula. When I said that I had, he asked if I knew how to prove that it was correct. While I didn’t know how to on the spot, I vaguely recalled some concepts from a class I took during undergrad that I thought might help. Together with some googling, I put together what we will work through below. Looking at the work I had done once I finished, I decided that what we’re about to embark on is a cool example of the type of visibility math can give us into things we might not be able to see otherwise. Let’s go!

Mathematical background

A sequence is just a list of finite, or even infinite, length. As examples, $1, 2, 3, 4, 5$ is a sequence, as are $2, 4, 6, 8, 10, \dots$ (the even numbers) and $2, 3, 5, 7, 11, \dots$ (the prime numbers). Each of these sequences have a first element, a second element, a third element, etc., so when we want to prove general assertions about sequences without explicitly writing down their elements, we denote the elements by $a_1$ (the first element), $a_2$ (the second element), $a_3$ (the third element), …. For the sequence $1, 2, 3, 4, 5$, we would say that $a_1 = 1$, $a_2 = 2$ and so on. Another important characteristic that usually comes along with sequences is some kind of rule that tells you how to come up with a given element. That is, many sequences come with a formula, in terms of $i$, that can produce the value of $a_i$. In the case of the even numbers, that formula is $a_n = 2n$ (to compute the ith element, calculate $2 \times i$). To get the 3rd even number, compute $2 \times 3 = 6$. To get the 37th, compute $2 \times 37 = 74$.

The Fibonacci sequence is a recursive sequence. This means that it has a special kind of rule whereby the next term is defined in terms of the previous terms. The rule that defines the Fibonacci sequence is:

$$F_n = F_{n-1} + F_{n-2}.$$

In English, this says that $n$th element of the Fibonacci sequence is the sum of the two previous elements; the 100th element is the sum of the 99th (which is the sum of the 98th and 97th, and so on) and the 98th (which is the sum of the 97th and 96th, etc.). Each of the two previous elements is the sum of the two elements before them, and so on and so forth. But, you might say, this can’t go on forever, there has to be some bottom at which the recursion ends right? There is indeed! In order to define a recursive sequence, you need the first few elements (usually one or two) and a way of using previous elements to create new ones. In our case, we will set $F_0 = 0$ and $F_1 = 1$.

You’ll notice that in order to calculate the $n$th element of this sequence, you have to do a good deal more work than you do to calculate the $n$th even number. To get the $n$th Fibonacci number, you have to traverse the chain all the way back to the beginning for each element you want to calculate, while the even numbers have what is known as a closed form, i.e. the aforementioned formula in terms of $n$ that you can just plug $n$ into to get the desired element. The question we’ll tackle for the rest of the post is: Does the Fibonacci sequence have a closed form? In other words, is there a formula, in terms of $n$, that I can just plug into to get the $n$th element of the Fibonacci sequence?

Using difference equations

First, let’s restate the problem. We want to find some formula for $F_n$ that satisfies $F_n = F_{n-1} + F_{n-2}$, or $F_n - F_{n-1} - F_{n-2} = 0$. Our first quantum leap is to guess that $F_n = m^n$ for some real number $m$ that we’re going to calculate. By guessing that $m^n$ is a solution, we are saying that $m^n - m^{n-1} - m^{n-2} = 0$. It’s easy to see that $m = 0$ satisfies this equation, but that’s boring, so let’s look for an $m \neq 0$.

Because we are now considering only nonzero values of $m$, we can divide $m^n - m^{n-1} - m^{n-2} = 0$ by $m^{n-2}$ on both sides so that we are now looking at values of $m$ that satisfy

$$m^2 - m - 1 = 0$$

(this is called the characteristic equation).

Using the quadratic formula you probably learned about sometime during middle school, we find two values of $m$ that work: the first is $\frac{1 + \sqrt{5}}{2}$ and the second is $\frac{1 - \sqrt{5}}{2}$. We will refer to these two values as $m_1$ and $m_2$ respectively. Recall that quadratic equations don’t always have real roots - sometimes they’re complex (i.e. they’re of the form $a + bi$ where $a,b$ are real numbers and $i$ is $\sqrt{-1}$)! For our second and last quantum leap, we’re going to take for granted the fact that when you’re solving what’s called a (deep breath now) second order homogeneous difference equation with constant coefficients (which $F_n - F_{n-1} - F_{n-2} = 0$ is), if the characteristic equation has distinct real roots (in our case, if $m_1$ and $m_2$ are real and different from one another) then the solution to our difference equation (our formula for $F_n$) has the form

$$ F_n = Am_1^n + Bm_2^n $$

where $A$ and $B$ are constants. We already found $m_1$ and $m_2$, so we just need to find $A$ and $B$. For this, we use the fact that we start the Fibonacci sequence at 0 and then 1, i.e. $F_0 = 0$ and $F_1 = 1$. Let’s use these to finish the problem. Using $F_0 = 0$, we have

$$ F_0 = 0 = Am_1^0 + Bm_2^0 = A + B $$

so that $B = -A$. Then we can use $F_1 = 1$ to see that

$$ F_1 = 1 = Am_1^1 + Bm_2^1 = A \frac{1 + \sqrt{5}}{2} + B \frac{1 - \sqrt{5}}{2}. $$

Because $B = -A$, this is the same as

$$ 1 = A \frac{1 + \sqrt{5}}{2} - A \frac{1 - \sqrt{5}}{2} = A\biggr(\frac{1 + \sqrt{5}}{2} - \frac{1 - \sqrt{5}}{2}\biggr) = A\sqrt{5} $$

so $A = \frac{1}{\sqrt{5}}$ and $B = -A = -\frac{1}{\sqrt{5}}$. Thus,

$$ F_n = \frac{1}{\sqrt{5}}\biggr(\frac{1 + \sqrt{5}}{2}\biggr)^n -\frac{1}{\sqrt{5}}\biggr(\frac{1 - \sqrt{5}}{2}\biggr)^n. $$

Pulling out the $\frac{1}{\sqrt{5}}$, we have

$$ F_n = \frac{1}{\sqrt{5}}\biggr(\biggr(\frac{1 + \sqrt{5}}{2}\biggr)^n - \biggr(\frac{1 - \sqrt{5}}{2}\biggr)^n\biggr). $$

Implementation

Just to drive home why I think this is cool: we started out with a sequence defined by adding integers to each other in a pretty simple way and by using some techniques that feel kind of heavy and opaque, we waved a magic wand, $\sqrt{5}$ showed up and we produced a totally alien and unfamiliar, yet correct representation of the elements of the Fibonacci sequence. To check our work I wrote a short computer program that computes the first 10 fibonacci numbers:

(Note: the lines that start with # are not actually code, they are comments to help guide the reader.)

def nth_fibonacci_number(n):
 # Given a number n as input, this function computes
 # the nth Fibonacci number

 # Reflects our assertion that F_0 = 0
 if n == 0:
 return 0

 # Reflects our assertion that F_1 = 1
 if n == 1:
 return 1

 # For any number n greater than or equal to 2, use the
 # formula we came up with. The symbol * is multiplication,
 # / is division, ** is exponentiation and math.sqrt takes a
 # square root. The below is one line, the "\" character is
 # just a python technicality.
 nth_fib = (1 / math.sqrt(5)) * ( ((1 + math.sqrt(5)) / 2) ** n \
 - ((1 - math.sqrt(5)) / 2) ** n)

 # Return the result
 return nth_fib

When I ran this code with $n = 0, 1, 2, 3, …, 9$, I got

$$ \begin{align*} F_0 &= 0\\ F_1 &= 1\\ F_2 &= 1\\ F_3 &= 2\\ F_4 &= 3\\ F_5 &= 5\\ F_6 &= 8\\ F_7 &= 13\\ F_8 &= 21\\ F_9 &= 34 \end{align*} $$

which is what you’d get if you went and calculated the first 10 Fibonacci numbers the usual way.

Conclusion

In my opinion, what we did here is far less important than how we did it. We started with an object that appeared simple. We then pulled techniques out of the mathematical netherworld to twist it into something entirely unrecognizable – even, dare I say, scary. Yet somehow, once we were done with the trickery and misdirection, we found something in our hand that… well… just worked. Why did it work? Why should it even exist? Because someone curious enough reached into the abstract and found it. It’s as simple as that.

An introduction to convex optimization

Fri, 04 Sep 2020 00:00:00 +0000

The General Idea

Over this past summer (2020), I took Stanford’s EE364A course, which is about a subdiscipline of mathematical optimization called convex optimization. I learned that it has myriad applications all over engineering, finance, medicine, signal processing, and many other seemingly disconnected fields. In this post, I want to discuss what convex optimization is and what makes it so useful as a problem solving technique.

In as general a sense as possible, we humans solve optimization problems all the time. We find ourselves in situations where we have to make decisions about which choice will lead to the best outcome (or the least bad outcome in some unfortunate cases), but most of the time, the space of possible decisions is constrained. When we find ourselves saying things like “If the stars aligned, I would…,” or, “In a perfect world, I’d choose…,” we’re usually lamenting the fact that the decision we’re facing would be easier to make if our options were not constrained.

The field of mathematical optimization attempts to bring mathematical formalism to the above-described decision making process. A mathematical optimization problem has a few components:

Decision variable: This represents the space of decisions you have to make. Solving an optimization problem involves finding the “best” value of a decision variable, where best is defined by your choice of…
Cost/Profit (a.k.a. objective) function: This represents some negative/positive utility of a decision.
Constraints: These constrain the space of decisions you can make. The constraints imply a feasible set, which is a set of allowed decisions.

Solving an optimization problem with these components amounts to finding a feasible value of the decision variable that minimizes (or maximizes) the cost (or profit) function. We will see examples of this below.

While mathematicians have managed to develop a deep, rich mathematical theory around how to reason about and solve different kinds of mathematical optimization problems, unfortunately, many (most) optimization problems are computationally very difficult to find optimal solutions for. For problems of significant enough size, this effectively renders them intractable.

There is, however, a certain interesting class of optimization problems that can be solved very efficiently: these are called convex optimization problems. What makes a convex optimization problems easier to work with is the “shape” of its objective function and its feasible set. In order for an optimization problem to be convex, the objective function and feasible set must be convex.

Convex sets and convex functions

In an effort to not use any math symbols in this post, I’m going to resort to pictures to describe what convexity means and looks like for sets and functions. A set is convex if anytime you pick two points in a set, the line between those points is also inside the set. A picture from Chapter 2 of Convex Optimization by Boyd and Vandenberghe provides the idea:

In the left set (all the points on the inside of the hexagon), pick any two points you want. Now draw a line segment between them. Notice that no matter what pair of points you pick, all points along the line segment between them will lie inside the hexagon! In the right set, on the other hand, we see pair of points (there are actually many, can you find another pair?) for which the line segment between them is not fully contained within the set. The set on the right is thus not convex.

Loosely speaking, a function is convex if it has upward curvature. The simplest example that gets the point across is to think of a smile. In three dimensions, a simple example is a bowl. In more mathematical parlance, upward curvature means that any tangent line (or, in higher dimensions, plane) that you draw through any point on the function lies below the function itself. In the picture on the next page, the red and green lines are tangent to the black curve. Notice that the red, green, and any other tangent line you might draw lie or would lie beneath the black curve.

That’s it! Those properties and some clever reasoning are enough to ensure that if a problem is convex, you can probably solve it efficiently with the right software. As long as the space of allowed decisions is a convex set and the objective function is convex, you’re off to the races!

Examples

Before closing this post, I want to describe two examples of how you might translate an industry problem into a convex optimization problem.

The first has to do with radiation treatment planning for cancer patients. The goal is to figure out how to schedule a radiation treatment plan (over some period of time) that trades off damage to a patient’s health with shrinking the size of a tumor. For this problem, we set it up as follows:

The decision we have to make is how much treatment to administer in each time period.
The goal is to minimize the maximum damage to the patient over the entire course of the treatment.
We are constrained by:
- A maximum allowed dose in each period.
- Wanting the tumor to be below some maximum tumor size.
- The way the patients health changes with treatment over time.

With a few mathematical tricks, this problem can be formulated as a convex problem and solved very efficiently. As you might imagine, balancing patient health and tumor size is obviously critical for the health and longevity of patients. I think this is a great example of one case in which convex optimization allows us to threat that needle very precisely. (Note: This problem was actually a problem on the final exam in the course I took. It was derived from real research by the professor.)

Another example of the effectiveness of convex optimization comes from a very different application: finance. In this example, we want to choose a portfolio that minimizes risk while achieving a particular expected return. In its simplest form, this is a convex optimization problem that can be formulated as follows:

The decision we have to make is to pick a portfolio (from a universe of a bunch of stocks).
The cost of the portfolio is the amount of risk the portfolio holds. We want to minimize this quantity.
We are constrained by the expected return that we want to achieve. Any portfolio we pick has to achieve a particular expected return.

We can add other constraints (e.g., no short-selling) and add terms to the cost function (e.g., tax liability, transaction costs, which we would also want to minimize), but optimization problem we just formulated, in some form or other, is at the core of most of the portfolio construction being done in industry today.

Conclusion

Examples of convex optimization’s uncanny effectiveness and ubiquity are everywhere, but there’s an important point I want to stress before we close. In each of the examples above, the accuracy and utility of the output depends on very human choices about the problem setup. In the treatment planning example, it depends on the models the mathematicians and doctors come up with for how patient health changes and how tumor size changes. In the portfolio construction example, it depends on how good our projections of expected returns are. Some of what I showed and talked about in this post is heavily mathematical, but modeling these problems well is truly an art. So while the math is important and the engine that makes all of this possible, none of it really works without consistent communication and collaboration with domain experts.

Finally, I want to thank Stephen Boyd, Anqi Fu, and the rest of the EE364A staff for a fantastic course. I really cannot overstate how excellent the class was. If you’re interested in seeing what some of this is about in more detail, a free version of the class is available on YouTube and the lecture slides and course textbook are available for free here.

I did it! No math symbols!

How many infinities are there?

Sun, 05 Apr 2020 00:00:00 +0000

(This post assumes you’ve read, at least, this and this.)

All of the posts on infinity that I’ve written to this point have pointed to two different infinite sizes, or cardinalities. The first is the countable kind, the kind we associate with the natural numbers $\mathbf{N}$. The other is the kind we associate with the real numbers, $\mathbf{R}$; we call $\mathbf{R}$ an uncountable set.

Before we proved that $\mathbf{R}$ has a different size than $\mathbf{N}$, we made a convincing intuitive case that there is really only one kind of infinity. That infinities come in at least two varieties was one of many counterintuitive, foundational results Cantor added to the foundation of rigorous transfinite mathematics. But there is yet another question we have still not answered: Are there more than two transfinite cardinalities, or do all infinite sets have either $|\mathbf{N}|$ or $|\mathbf{R}|$ elements?

This post concerns a result that Cantor proved in the same 1891 paper in which his diagonalization argument for the uncountability of the reals appears. It is called Cantor’s Theorem, and it shows that there are actually infinitely many infinities. What does that mean? How is that possible? Let’s dive in and see.

Before we get into it, we establish a small amount of notation. If $a$ is an element of $A$, we write $a \in A$. The size of the set $A$, or $A$’s cardinality, is denoted $|A|$. The power set of a set $A$, denoted $P(A)$, is the set of all subsets of $A$. For example, the power set of $A = \{1, 2\}$ is $P(A) = \{\emptyset,\{1\}, \{2\}, \{1, 2\}\}$, where $\emptyset$ is the empty set. There is actually a straightforward argument that shows that if $|A| = n$, then $|P(A)| = 2^n$. It goes like this. Each element $a$ in $A$ is either in or not in each subset of $A$. Thus, constructing a subset requires $n$ choices, each of which is between two options (“in” or “not in”); the number of such subsets is thus $2 \times \dots \times 2 = 2^n$.

Cantor’s Theorem is quite compact. It says that $|P(A)| > |A|$; in English, the number of subsets of $A$ is strictly greater than the size of $A$. But based on what we just showed, shouldn’t this be obvious? For any $n \geq 0$, $2^n > n$, after all. What is so revolutionary or helpful about Cantor’s Theorem? As is often the case with theorems in set theory, the subtlety stems from having to extend the result to infinite sets.

The beauty of Cantor’s Theorem is that with one elegant argument, Cantor proves the above for any set, finite, countably infinite, or uncountably infinite. The way he does it is by using a proof technique called proof by contradiction. The technique consists of the following steps:

Assume the opposite of what you want to show. Show that the proposition from (1) leads to some absurdity. The assumption that led to the absurdity must be false, so the opposite (your original claim), must be true.

The proof proceeds as follows. First, we assume that $|A| = |P(A)|$. In order to handle infinite sets, this means that we are assuming that each element $a \in A$ can be matched with exactly one element $S_a$ of $P(A)$ (a subset of $A$) with none left over. For each element-subset pair, $(a, S_a)$, either $a \in S_a$ or $a \notin S_a$. Now consider the set of $b \in A$ for which $b \notin S_b$. That is, consider the set of all elements that are not contained in the sets they map to; because this group of elements is a subset of $A$, we can refer to it as $S_c$ for some $c \in A$. We now have to answer a simple question: Is $c \in S_c$?

If $c \in S_c$, then $c$ is an element that is not in the set it maps to, namely $S_c$… which is absurd. But then surely, $c \notin S_c$… right? If $c \notin S_c$, though, it means that $c$ is an element that is in the set it maps to, namely $S_c$. But if it is in $S_c$, we get into the same pickle we were in in the first case. Thus, $c$ cannot be in $S_c$ and $c$ cannot not be in $S_c$. We have thus reached our desired contradiction! Our assumption, that $|A| = |P(A)|$, must be false.

We’re not quite done, though. We’ve shown that $A$ and $P(A)$ are not the same size, but which is bigger? Well the $S_c$ that did not have a matching $c$ was a member of $P(A)$, so we must have $|P(A)| > |A|$, proving the theorem.

Cantor’s Theorem answers this post’s original question. Can you see how? It says that if you take any infinite set $A$, its power set $P(A)$ furnishes a larger infinity. But then we get an even larger infinity when we form $P(P(A))$, and a yet larger one when we consider $P(P(P(A)))$.

If two infinities weren’t enough, now you’ve got as many as you like.

The weak law of large numbers

Mon, 31 Dec 2018 00:00:00 +0000

Introduction

When I say that Alice is better than Bob at chess, I’m effectively asserting that barring careless mistakes, Alice should always beat Bob. This notion, of Alice “being better than” Bob is easy to wrap one’s head around when the game in question doesn’t have to contend with randomness or uncertainty. What does it mean, though, to say that Alice is better than Bob at a game like backgammon, where dice (a source of randomness) are involved? The rest of this post aims to provide some mathematical machinery to answer such a question.

Whenever one talks about expecting some “eventual” outcome of a game or experiment, he or she is actually invoking a fundamental statistical law: the Law of Large Numbers (henceforth LLN). The LLN is one of a group of fundamental results in probability theory known as limit theorems, which are statements about how certain aspects of random variables stabilize as you make more and more observations of them. In plain English (we’ll get to it’s technical formulation a bit later), the LLN (and we’ll see that there are actually two variants) says that if you have an experiment whose average outcome is a number $m$, then as you try the experiment more and more times, the average value of your collection of outcomes will tend to $m$.

Examples

To get a feel for what’s going on here, let’s look at a few examples that demonstrate the LLN’s ubiquity and importance.

When you play blackjack against the house, its ability to make money hinges on a critical assumption: (without cheating,) even using a probabilistically optimal strategy, the chances that you win a hand is less than 50%. The house edge might be very very small, but as long as it has some nonzero edge, the house stands to make money in the long run because it is playing lots and lots of hands. This assumption relies on the LLN.
Let’s imagine that a basketball player is in a shooting slump. He regularly shoots around 40% from outside the three point line, but of late he’s only managed to connect on 20% of his attempts. Encouragement offered to such a player typically looks like “Don’t worry, just keep shooting, your shot will come back!” A coach who lifts his player’s spirits this way is also relying on the LLN.
To return to our backgammon example, when we say that Alice is better than Bob, what we’re saying is that in any individual game, it’s possible that Bob beats Alice, but if Alice and Bob were to play 100 or 1000 or 1000000 games against Bob, Alice would end up winning a majority of them. The more games they play, the more obvious the advantage would become.

Formal statement

As I’ve noted in many other posts, it is one thing to have a good idea, another to formalize it mathematically. The concept underlying the LLN is one that most of us intuitively grasp without understanding statistics. But in order to prove it and then use it as a building block to understand other, more subtle results, we need to be able to state it formally.

Before we do, I want to note that there are actually two well known versions of the LLN. We will concern ourselves here with the Weak Law of Large Numbers (the Strong Law is harder to prove but says something similar in spirit). Mathematically, we would state it as follows (don’t worry – we’ll break each piece of this down momentarily): Let $X_1, \dots, X_n$ be independent and identically distributed random variables all having finite average value $m$. Let $\bar X_n = \frac{1}{n}\sum_{i=1}^n X_i$. Then for any $\varepsilon > 0$, $P(|\bar X_n - m| < \varepsilon) \to 1$ as $n \to \infty$.

Let’s dissect this piece by piece and make sure it makes sense:

“Let $X_1, \dots, X_n$ be independent and identically distributed random variables”: This means that each observation $X_i$ is independent of all the others and is produced by the same distribution as all the others. Think of a black box that independently spits out a sequence of $n$ numbers using some fixed, unknown probability distribution. Each number the box spits out would be represented by an $X_i$. (The theorem is going to say what we can infer about the average of the black box distribution’s average value once we’ve made a sufficiently large number of observations.)
“Let $\bar X_n = \frac{1}{n}\sum_{i=1}^n X_i$”: $\bar X_n$ is the average value of the observations.
“Then for any $\varepsilon > 0$”: This is a math trick students usually first come across in real analysis. You basically pick some arbitrarily small value of $\varepsilon$ and then show that some quantity can always be made smaller than the value you chose. (In our case, that value will be the difference between the average value of the observations and the true average $m$.)
“$P(|\bar X_n - m| < \varepsilon) \to 1$ as $n \to \infty$”: The probability that the average value of the observations and the true mean of the underlying distribution differ by less than $\varepsilon$ becomes more and more certain as you take more and more observations.

Read that again if it didn’t make sense. Once all that sinks in, go back and read the formulation; I hope you find that it’s a delightfully compact way of formalizing the intuition we started with. Our final act will be to prove it. If you haven’t heard of Chebyshev’s inequality, read this before attempting the proof.

Proof

We will assume for this proof that the $X_i$ have finite variance $\sigma^2$, though this assumption is not necessary. (It’s just that if we don’t make that assumption, the proof gets more complicated.) First, let’s compute the mean and variance of $\bar X_n$. Because expectation is linear, and $\bar X_n$ is just a sum of RVs each having mean $m$, we have

$$ \begin{align*} E(\bar X_n) &= \frac{1}{n}\sum_{i=1}^n E(X_i) \\\\ &= \frac{1}{n}\sum_{i=1}^n m \\\\ &= mn/n \\\\ &= m. \end{align*} $$

To compute the variance, we use the independence of the $X_i$ to write that

$$ \begin{align*} \text{Var}(\bar X_n) &= \text{Var}(\frac{1}{n}\sum X_i) \\\\ &= \frac{1}{n^2}\text{Var}(\sum X_i) \\\\ &= \frac{1}{n^2}\sum \text{Var}(X_i) \\\\ &= \frac{1}{n^2}n\sigma^2 \\\\ &= \sigma^2/n. \end{align*} $$

Before we continue note that the fact that variance is a decreasing function of $n$ just about makes sense. The more observations I take, the smaller my variance should intuitively be.

Next, we use Chebyshev’s inequality with $\mu = m$, $k = \varepsilon$ and $X = \bar X_n$ to say that

$$ \begin{align*} P(|\bar X_n - m| < \varepsilon) &= 1 - P(|\bar X_n - m| > \varepsilon) \\\\ &\geq 1 - \frac{\sigma^2/ n}{\varepsilon^2} \\\\ &= 1 -\frac{\sigma^2}{n\varepsilon^2}. \end{align*} $$

(The inequality sign is flipped because we have that $1 - \dots$ in there. Otherwise, it’s just plug and play Chebyshev.)

As $n$ gets larger and larger, that rightmost term tends to 0, so the probability of interest is bounded below by 1, and voila! We’re done.

Conclusion

This law is intuitive, deep and also deeply embedded in the way that we as human beings trying to navigate the world deal with uncertainty, so I thought it deserved a post.

Happy New Year to all!

Bounding probabilities with Markov and Chebyshev

Sun, 04 Nov 2018 00:00:00 +0000

Introduction

Very often, finding exact answers is a pain; ballpark estimates usually suffice. When you’re nervous to board a plane, you don’t care to calculate the exact probability that the plane will land safely; you only care that it’s over $99.99\%$. When you’re trying to figure out how long a project will take your team at work, you use approximations throughout your calculation because there are far too many unknowns and variable to compute the exact answer. Will a member of your team get sick and have to take time off? How quickly will your team members learn a new technology? How productive will the new-hire be? Across computer science, there are a myriad problems whose solutions are, in a rigorous sense, intractable to calculate exactly. To solve such problems — they show up all over a variety of industries — we design approximation algorithms that sacrifice some optimality and/or deterministic correctness for gains in efficiency and simplicity.

Ballparking

In probability, one particular set of heuristic techniques we use is a set of inequalities called probability bounds. You would use them for some of the same reasons described above: intractability of an exact probability calculation; lack of need for an exact answer; generality (they don’t make many complex, esoteric or restrictive assumptions, so they apply to lots of different problems). In this post, I want to state and prove a few probability bounds and show how you can apply them. While the estimates might not always give you something useful to work with, it’s good to be aware of how to use the bounds should the opportunity present itself.

Example

As we often do when probability is concerned, let’s think about a sequence of coin flips. Instead of the usual unbiased coin, though, let’s say the coin falls heads with probability $p=1/3$ (and tails with probability $1 - p = 2/3$). In a sequence of 100 flips, what is the probability that greater than or equal to half of them fall heads?

The distribution one would use to model this question is most probably the binomial distribution. If we let the random variable $X =$ # of heads in 100 flips, we would say that $X \sim \text{Bin}(100, 1/3)$ ($X$ represents the number of times a head will fall in 100 flips when the probability of heads on each independent toss is 1/3). In the case of this particular problem, we can calculate the exact probability of 50 or more heads. The formula that computes the exact solution looks like (don’t worry about what the symbols mean — not important, but if you’re interested, take a look at this (https://en.wikipedia.org/wiki/Binomial_distribution)):

$$\sum_{k=50}^{100} {100 \choose k}(\frac{1}{3})^k(\frac{2}{3})^{100-k} \approx 0.00042.$$

Without a bunch of cleverness or a computer, it would be very hard, if not impossible, to carry out that computation by hand. Let’s see if those probability bounds I talked about can lend us a hand. Before we do, we need to introduce two facts about binomial random variables. If $X \sim \text{Bin}(n, p)$, the expected value (average value) of $X$, written $E(X)$, is equal to $np$ (the number of trials times the individual probability of success), and the variance of $X$ (a measure of how much $X$ deviates from its average value), written $\text{Var}(X)$, is equal to $np(1-p)$. With that in mind, follow me.

Markov’s inequality

The first probability bound we’re going to look at is known as Markov’s inequality. Before we technically state it, in plain English, Markov’s inequality tells us about the probability of a random variable exceeding some value. Formally, if $X$ is a positive random variable (the outcomes of your experiments are strictly positive values), then for any real number $a > 0$,

$$P(X \geq a) \leq \frac{E(X)}{a}.$$

Before we prove this, I want to explicitly state that the primary advantage of Markov’s inequality is its generality. It usually doesn’t furnish the most useful bounds, but notice that in order to apply Markov, all we need to know that is that $X$ is a positive random variable. As we will see below, it is also an important building block with which we will come up with better probability bounds. Before we see what it would tell us about the problem we started with, let’s prove it.

Proof

Let the random variable $I = 1$ if $X \geq a$ and $0$ otherwise. The proof is mostly complete when you notice this convenient, but somewhat subtle fact:

$$I \leq \frac{X}{a}.$$

Why? Let’s think about it. If $I = 1$, then, by definition, $X \geq a$. Dividing by $a$ on both sides of $X \geq a$ gives us $\frac{X}{a} \geq 1 = I$. If $I = 0$, the above inequality holds because both $X$ and $a$ are positive. Taking the expectation of both sides (don’t worry — expectation doesn’t change the sign of the inequality), the above inequality turns into $E(I) \leq E(\frac{X}{a})$. We can pull $\frac{1}{a}$ out of the right hand term because you can pull constants out of expectation. Further, by the definition of expected value, $E(I) = 1 \times P(I = 1) + 0 \times P(I = 0) = P(I = 1)$ is just the probability that $X \geq a$ (by our definition of $I$), so the above expression can be rewritten as

$$P(X \geq a) \leq \frac{E(X)}{a},$$

so we’ve finished the proof.

A bound on the example using Markov

Now that we know that Markov’s inequality holds, let’s see what it can tell us about our problem. Recall that we defined $X$ to be the number of flips that fall heads in a sequence of $100$ tosses. In particular, we wanted to know what the odds were that there were more than or equal to $50$ heads, i.e. the probability that $X \geq 50$. Noting that the expected value of our particular $X$ (with $n = 100$ and $p = 1/3$) is $np = 100 \times 1/3$, we can plug these numbers into Markov and see that

$$P(X \geq 50) \leq \frac{100 \times \frac{1}{3}}{50} \approx 0.67.$$

This tells us that the probability that the number of heads we see exceeds $50$ is at most $0.67$. Not so useful, but do you see how simple that was to compute? Given the exact answer we computed above, this estimate doesn’t tell us very much, but the computation was so easy.

Chebyshev’s inequality

The next bound I want to look at is called Chebyshev’s inequality. While it looks and feels a bit different from Markov, it’s spirit is similar. Chebyshev is a statement about the probability that some random variable deviates from its average by a certain amount. Its formal statement is: if $X$ is a random variable with finite expected value $\mu$ and finite variance $\sigma^2$ (those two conditions hold enough of the time), then

$$P(|X - \mu| \geq k) \leq \frac{\sigma^2}{k^2}.$$

In English, the probability that $X$ deviates from its mean by at least $k$ is at most $X$’s variance divided by $k^2$. We will find this bound a bit more useful when we apply it to our problem, but before we do that, we need to prove it.

Proof

If $X$ is a random variable, then $X - \mu$ is a random variable. Furthermore, if $X - \mu$ is a random variable, then so is $(X - \mu)^2$. Now we apply Markov with $X$ replaced by $(X - \mu)^2$ and $a$ replaced by $k^2$, and we have:

$$P((X - \mu)^2 \geq k^2) \leq \frac{E((X - \mu)^2)}{k^2}.$$

The expression in the parentheses on the left side is equivalent to the expression $|X - \mu| \geq k$ (take the square root of both sides). Replacing what’s in parentheses with the equivalent formulation and recognizing the numerator on the right side as the very definition of variance completes the proof because we can rewrite the above as

$$P(|X - \mu| \geq k) \leq \frac{\sigma^2}{k^2}.$$

That proof might have actually been mechanically easier than Markov! In addition, Chebyshev’s inequality is usually more informative and almost as general as Markov, so let’s take a look at what it can tell us about our problem.

A bound on the example using Chebyshev

Recall that $\mu = np$ in our problem is approximately $33.3$ and that the variance is $\sigma^2 = np(1-p) = 100 \times 1/ 3 \times 2/3 = 200/9 \approx 22.2$. The information we’re missing here is what value of $k$ to choose. In this case, we want to know something about the probability that there are more than or equal to $50$ heads. $50 - 33.3 = 16.7$ so if we set $k$ to $16.7$, we will be able to apply Chebyshev to know an upper bound on the probability that $X$ takes a value that is more than $16.7$ away from $\mu$ in either direction. More directly, we would be able to say something about the probability that $X$ takes a value less than or equal to $33.3 - 16.7 = 16.6$ and greater than or equal to $33.3 + 16.7 = 50$. The probability that $X$ is less than or equal to $16.6$ is superfluous, but it doesn’t affect the correctness of our upper bound. By Chebyshev, we can conclude that

$$P(|X - \mu| \geq 16.7) \leq \frac{22.2}{16.7^2} \approx 0.079,$$

so our upper bound is now around $8\%$. Given the result of the explicit calculation we computed first, this isn’t especially tight either, but it’s much better than Markov and it was almost as easy to compute!

Conclusion

There are other, more powerful bounds that I won’t go into here because they’re more involved and this post is already long, but hopefully you’ve gotten an idea about where these sorts of things might be useful.

A different way of thinking about eigenvalues

Thu, 18 Oct 2018 00:00:00 +0000

Introduction

This post’s title is intentionally vague. Usually, I write an introduction that describes the path the post will walk and then I meander down that path from beginning to end, and write a conclusion that sums it all up. After thinking about how to write this post in the most engaging way, it occurred to me that mathematics has often felt the most satisfying to me when I’ve felt as though I was discovering definitions and theorems rather than being read them by professors. My hope is that in addition to conveying the beauty and ingenuity behind what follows, I am able to also pass along some of the wonder that I myself felt during journey. With that, follow me…

Review of linear operators

Suppose we have a vector space $V$. One of the most important objects (if not the most important) we study in linear algebra are structure preserving maps called linear transformations (a.k.a. linear maps). As a quick review, a transformation is (loosely) a function that takes a vector from one vector space and turns it into a vector in another using some rule. In order for $T$ to be called a linear transformation, we need the following two properties to hold:

If $u$ and $v$ are vectors, it must be the case that $T(u + v) = T(u) + T(v)$.
If $v$ is a vector and $c$ is a member of the $V$’s underlying field, it must be the case that $T(cv) = cT(v)$.

In English, (1) and (2) say that linear transformations must preserve addition and scalar multiplication – that is, adding/scaling then mapping must produce the same result as mapping then adding/scaling. For this post, we’re going to focus our attention on linear operators, or linear maps from $V$ to itself.

Invariant subspaces

So suppose $T$ is an operator on $V$. It’s natural to wonder how $T$ behaves with respect to subspaces of $V$. In the case of an operator, it will always be the case that $T$ maps a subspace $U$ to some subspace of $V$, but how $T$ transforms an arbitrary choice of subspace is unclear. Let’s simplify a little bit and think about the neatest-possible set of outputs that $T$ might produce as it acts on vectors from $U$; we are going to focus on subspaces which, with respect to $T$, are in some sense self-contained. In other words, let’s require that $T$ map every vector in $U$ get back into $U$ somewhere. (A more compact way of phrasing our requirement is that we want $T$ to be an operator on $U$.) If $T$ behaves this way with respect to $U$, we say that $U$ is invariant under $T$. (In technical jargon, we say a subspace $U$ is invariant under $T$ if for every vector $u \in U$, $T(u) \in U$.)

I don’t know about you, but I still don’t feel on sure enough footing; let’s simplify further. Instead of letting our invariant subspaces get big and complicated, let’s restrict our focus to $T$’s invariant subspaces of the lowest possible dimension. As a brief aside, the dimension of a vector space is the smallest number of vectors, linear combinations of which can comprise an entire space. For example, consider $\mathbf{R}^2$ (the $(x,y)$ plane). It is a vector space of dimension 2. Why? Because I can make the two vectors $(0,1)$ and $(1,0)$ into any vector (coordinate pair) that I want! How? Notice that $(a,b) = a \times (1,0) + b \times (0,1)$. In the $\mathbf{R}^2$ example, we call $\{(0,1), (1,0)\}$ a basis (“a” basis because there are infinitely many others). A vector space’s dimension is defined as the size of any basis (all bases are the same size).

Eigenvalues

In light of our detour, what might a one-dimensional invariant subspace look like? Well, a one dimensional subspace has a basis of size one, which means that the subspace is made up of linear combinations of a single vector, i.e. its scalar multiples! In math terms – and in this case, I actually think the symbols help – a one dimensional subspace $U$ looks like

$$U = \{ au ~|~ a \in F \},$$

where $F$ is $V$’s underlying field.

Now let’s say this low-dimensional subspace is invariant under $T$. This would mean that $T(u)$ lands back in $U$, and given that $U$ is of dimension one, $T$ must send $u$ to a scalar multiple of itself. In other words, there is $\lambda \in F$ such that $T(u) = \lambda u$.

As you might have guessed by now, $\lambda$ is what we call an eigenvalue and $u$ is it’s corresponding eigenvector.

One of the central focuses of linear algebra centers around understanding the relationship between linear transformations (useful abstractions) and their matrices (computational tools). Eigenvalues and eigenvectors (loosely) allow you to write down computationally friendlier matrices corresponding to linear transformations. (The friendliest of these is known as a diagonal matrix – the only nonzero entries are those along the diagonal stretching from the top left to bottom right corners. You can write down a diagonal matrix if and only if you manage to find a basis of eigenvectors.) Eigenvalues and eigenvectors are typically not motivated particularly well. For a while, I trusted my professors that they were important and useful, but I’d really never seen a way in which they arise naturally.

They eigenvalues are presented as the roots of the characteristic polynomial of an operator. What’s the characteristic polynomial? If $A$ is the matrix corresponding to $T$ (huh?), then the characteristic polynomial is given by $p(\lambda) = det(A - I\lambda)$. The roots of $p$ are $T$’s eigenvalues. Wait, but what is that $\det$ symbol? Determinants, you say? What are those? How do I know that $det(A - I\lambda)$ is a polynomial? To know that, you’d need to know how to unpack all of those symbols, which requires understanding what they are… and before you know it, you’re down so far down so many rabbit holes that you stop thinking and just start accepting. A few days later, your professor moves on to matrix diagonalization and in a haze of all of the other things going on in your life, you’ve memorized a totally opaque technique that you’ve applied correctly just enough times to feel like you understand. This, I believe, is one of the sneakiest and most pervasive ways that beautifully intuitive math manages to pass students by.

Conclusion

Don’t fall victim! When you run up against a concept you don’t understand, keep struggling with it; don’t get lulled into indifference because you can compute a correct answer. Ask questions, there is none too small. From experience, I can tell you that there is an elegance waiting for you beyond the struggle. One so deep, fundamental and profound that you’ll be truly glad you stuck around.

Counting chord intersections: two approaches

Sun, 16 Sep 2018 00:00:00 +0000

Introduction

In non-mathematical disciplines, it is very often the case that approaching the same question in different ways will lead you to different conclusions. One of the beautiful qualities of math and the type of reasoning it requires is that for any given problem, there might be (and usually is) a plethora of different approaches. The discovery of a new approach adds perspective about the problem and allows the solver’s understanding of a particular area to deepen and expand. In helping my cousin with some homework recently (she figured out one of the solutions below before I did), I came across a wonderful example of the way different approaches inform each other. One is far more intuitive than the other, but I didn’t realize there was a more intuitive solution until after solving it in a more complex, algebraic way. Without further ado, let’s begin.

The problem

Let’s say you’ve got a class of 5 students, and you have them arrange themselves as follows. You tell them to stand in a circle and you give each pair of them a string to hold tight between them. If people are dots and the strings between them are the line segments, the arrangement might look like this:

The question we want to answer is:How many intersections are there in the middle of the circle? For the case when there are 5 students, the answer, by inspection of the picture above, is 5 (the vertices of the pentagon in the middle). For the remainder of this post, we will concern ourselves with the more general version of this question, which is: If there are $n$ people, can we find a formula – in terms of $n$ – that tells us the number of intersections in such an arrangement?

The remainder of this post will meander through two approaches; the first algebraic, and the second intuitive.

Algebraic approach

The first thing to do to make the problem more approachable is to simplify it. In this case, we will narrow our focus to the number of intersections that the strings emanating from one person are a participant in. If we can do that, we can multiply by the number of people there are* and divide by the number of times we overcount each intersection, and voila, problem solved. At this point we observe two simple facts:

There are a total of $n$ people.
Each intersection is going to be counted 4 times. Why? Because each intersection requires two strings, each of which has two people holding the ends. Thus, once I’ve come up with my formula for the number of intersections per person, I have to remember that simply multiplying the result by $n$ is going to count each intersection 4 times, so we need to divide the total number of intersections we find by 4.

From (1) and (2), we can deduce that our answer is going to have the form:

$$\frac{n}{4} \cdot \text{intersections per person}$$

The last thing we need to do is figure out the way to express the number of intersections per person. We will derive this in steps:

Each string that crosses the middle of the circle splits the circle into two groups. For simplicity, let’s call them the right-hand and left-hand groups.
The number of edges that intersect the splitting string is exactly the number of strings that pass from the right-hand group to the left-hand group.
Let’s say that there are $k$ people in the right-hand group. Because there are 2 people holding the splitting string, the left group must have the remaining people, i.e. $n - k - 2$. Each of the $k$ people in the right group shares a string with each of the people in the left group, so the number of strings that cross the splitting string is given by $\text{(\# of people in right group)} \cdot \text{(\# of people in left group)} = k (n - k - 2)$.
We now just have to treat each string that a person is holding as a splitting string, use the formula from (3) and add the results up.

If you understand steps (1)-(4), the rest is just boring, but unfortunately necessary algebra. If you’re new to discrete math, you should try to work this out yourself; there are some summations in there that you’ll encounter over and over again and it’s a good way to get familiar with some of them. If algebra isn’t your thing, trust me that this approach works and skip to the intuitive approach below.

The goal now is to simplify

$$\sum_{k=1}^{n-3} k(n-k-2).$$

(We will multiply by $n/4$ at the end.) Our first step will be to distribute the $k$ to get:

$$\sum_{k=1}^{n-3} nk - k^2 - 2k.$$

From now on, I’m going to write the summation without the indices (because typing them is tedious), but they’re implied. Splitting up into three summations, we have:

$$\sum nk - \sum k^2 - \sum 2k.$$

Pulling out constants, this is equal to

$$n\sum k - \sum k^2 - 2 \sum k = (n-2)\sum k - \sum k^2.$$

We know (but if you don’t, try to prove it yourself), that $\sum_{i=1}^m i = m(m+1)/2$. Once you’ve proven that, you can use a similar method to prove that $\sum_{i=1}^m i^2 = m(m+1)(2m+1)/6$. Substituting these in and recalling that $m$ in our case is $n - 3$, we have

$$\frac{(n-2)(n-3)(n-3+1)}{2} - \frac{(n-3)(n-3+1)(2n-6+1)}{6}$$

Simplifying that monster,

$$ \begin{align*} \frac{3(n-2)(n-3)(n-2) - (n-3)(n-2)(2n-5)}{6} &= \frac{(n-3)(n-2)(3n - 6 - (2n - 5))}{6} \\\\ &= \frac{(n-3)(n-2)(n-1)}{6}. \end{align*} $$

Multiplying that by $n/4$, we end up with the surprisingly neat formula for the total number of intersections:

$$\frac{n(n-1)(n-2)(n-3)}{4!}.$$

For those who have seen this formula before, this is exactly the number of ways to choose a group of 4 people from a population of size $n$. Once I had spent all this time doing all that algebra, it occurred to me that there had to be an easier way to articulate the solution…

Intuitive approach

The simple way to solve the problem is to use one observation (that we actually made earlier): each intersection corresponds exactly with a single group of four people. If you draw the picture for $n = 4$, you’ll clearly see the single intersection. This observation finishes the problem for us… Can you see how? If we know that each intersection corresponds with exactly one group of four people, that means the total number of intersections must be exactly equal to the number of groups of four people you can choose from a group of size $n$, i.e. $\frac{n(n-1)(n-2)(n-3)}{4!}$.

Conclusion

Neither of the above approaches is better than the other, per se. Each solution requires its own set of deductions and observations, the collection of which leave you with a richer understanding of how you might solve similar problems in the future. In this case, it didn’t even occur to me that there might be a simpler solution until I had already worked hard to find a more involved approach. Enjoy the process! The things you end up understanding the most deeply are the things you can think of and explain from a bunch of different angles. Thanks for reading!

*This is a nice example of a phenomenon in mathematics called symmetry. We say that an object is symmetric under some transformation $T$ if the object doesn’t change when you apply $T$ to it. In our case, notice that in the picture at the beginning of the post ($n = 5$), if you calculate the solution for any particular person $p_1$, you have solved it for all of the others – just rotate the next person you want to solve it for into $p_1$’s position and notice that the problem you’re solving for the new person is of exactly the same form as the one you solved for the first person!

Tale of two distributions

Thu, 02 Aug 2018 00:00:00 +0000

Introduction

A theme of the many posts I’ve written over the last few years is that there are deep and beautiful connections we find between apparently different areas by appealing to a little bit of formalism and some finely-tuned intuition. In this case, the two objects I will connect do not look too different from one another. In fact, they look eerily similar; it just isn’t immediately clear how to connect the dots.

(For this post, I’m going to assume a basic understanding of random variables and probability distributions (more or less just what they are). I’ll also assume basic familiarity with permutations and combinations — just the definitions — and some basic facility with limits. It’s more than I like to assume and I’ll do my best to explain things intuitively as I go along, but if you stick with me, I think it will be well worth your while. This stuff is very, very cool.)

Binomial distribution

Suppose I’m flipping a fair coin. If $X$ is a random variable that takes the value 1 on heads and 0 on tails, we might encode the likelihoods of different outcomes of this experiment as $P(X = 1) = P(X = 0) = 1/2$. More generally, if I have some experiment with two possible outcomes, one of which occurs with probability $p$ (we call this event a success) and the other of which occurs with probability $1 - p$ (we call this a failure), a random variable representing the outcome of the experiment is said to be a “Bernoulli $p$” (henceforth $\text{Bern}(p)$) random variable. The random variable $X$ from the example I opened with is $\text{Bern}(1/2)$, but the general story that describes what sorts of sample spaces might hint at this underlying distribution is the same no matter what $p$ is: I have some experiment with two possible outcomes, one that occurs with probability $p$ and another that occurs with probability $1 - p$.

Let’s make things more interesting. Suppose I repeat this Bernoulli (read: two-outcome) experiment $n$ (read: a bunch of) times, one after the next, where each trial is independent of all the others and ask you: What is the probability that $k$ (read: some number smaller than $n$) of experiments result in a success?

(Think about this and try to work out the answer before reading on.)

As Alon once recently put it, your internal monologue should go something like: “Well… Because my trials are independent of one another, I’m probably dealing with a bunch of probabilities multiplied together. The probability of one success is $p$, so the probability of $k$ successes has got to be $p^k$. The remaining $n - k$ events must be failures, so by similar logic, the probability that the rest of the experiments fail is $(1-p)^{n-k}$. Multiplying these together, I’m pretty sure the answer is $p^k(1-p)^{n-k}$.”

Almost! There’s just one piece you’d be missing. As an example, let’s think about the different ways to get two heads in three coin flips. You can either get HHT, HTH or THH. Each of these has probability $(1/2)^2(1/2)$, but there are three ways to achieve an outcome with that probability, so the probability of getting two heads actually ends up being $3(1/2)^2(1/2)$.

More generally, the probability of any one ordering of experiment outcomes with $k$ successes is indeed $p^k(1-p)^{n-k}$, but to be complete, you need to count up all of the different ways to arrive at $k$ successes in a sequence of $n$ trials. The number of configurations in which $k$ of the $n$ trials are successes is ${n \choose k}$, so the probability of seeing $k$ successes in $n$ independent Bernoulli trials is ${n \choose k}p^k(1-p)^{n-k}$.

The above story, wherein I’m trying to compute the probability of seeing a certain number of successes across a bunch of Bernoulli trials, describes what is called the binomial distribution. (An interesting note here: if you multiply out $(a + b)^n$, you’ll get $a^n + na^{n-1}b + {n \choose 2}a^{n-2}b^2 + \dots + b^n$. See if you can phrase what’s happening there in the language of permutations and combinations.) The binomial distribution is one of the most famous discrete distributions there is and it has a wide range of applications all over probability. Before explaining why I’ve led you here, we need to take a quick detour.

Poisson distribution

While the binomial distribution is just doing its thing, minding its own business, in a galaxy far, far away sits another distribution: the Poisson. The Poisson distribution is usually used to describe spaces in which many trials occur and trial has a very small probability of success. As an example, a Poisson random variable might represent the number of emails you received in the past hour. The likelihood that any one person emailed you during that hour is very low (low probability of success), but the number of people (trials) who could possibly email you is very high.

Deriving the PMF (probability mass function — essentially a formula that allows you to calculate the probabilities of various events) for this is not as easy as it was to do with the binomial, but before we jump in and try to figure out what that might look like, it’s useful to note that our Poisson story bears some interesting similarity to the story we used to define the binomial. In both cases, there are a bunch of trials taking place, each of which has some probability of success and we want to know what the probability of seeing a certain number of successes is. To crystallize this fuzzy-seeming connection, and this is the critical point, what we are trying to capture with the Poisson story is essentially the same thing we captured when we derived the binomial, but specifically when $n$ is large and $p$ is small. Can we formalize this somehow?

The connection?

We can! What we’re going to do first is to define the relationship between $n$ and $p$. Above, we revealed the need to express that $p$ is small and $n$ is large. One way to do that is to enforce that $np$ be constant (there are other ways, but this is the one that will be useful here); we’ll name that constant $\lambda$. Why does this help us? Well, if I pick a value of $\lambda$, say 1, before I start, then once I choose $n$, I’ve determined $p$. As an example, if $n = 100$, then because $np = 1$, $p$ must be 1/100. Furthermore, as $n$ gets larger we see that $p$ has to get smaller to keep lambda constant, so if we let $n$ tend to $\infty$, we are implicitly forcing $p$ to tend to 0. Being able to write $p$ in terms of $n$ is going to be very helpful in what’s to come.

With the relationship between $n$ and $p$ specified more precisely, we will now see what happens to the binomial PDF as $n$ grows. That is, we want to compute

$$\lim_{n \to \infty} {n \choose k} p^k(1 - p)^{n - k}.$$

Because $p = \lambda / n$ we can rewrite all the $p$s in the above limit as $\frac{\lambda}{n}$s:

$$\lim_{n \to \infty} {n \choose k} (\frac{\lambda}{n})^k(1 - \frac{\lambda}{n})^{n - k}.$$

Simplifying a little bit, the limit can be rewritten as

$$\lim_{n \to \infty} \frac{n!}{k!(n-k)!} \frac{\lambda^k}{n^k}(1 - \frac{\lambda}{n})^{n - k} = \lim_{n \to \infty} \frac{n(n-1)(n-2)\dots(n-k+1)}{n^k} \cdot \frac{\lambda^k}{k!} \cdot (1 - \frac{\lambda}{n})^{n - k}.$$

We will look at each of term of the product one at a time. Provided that their limits all exist, we can glue them together. For the leftmost multiplicand, noting that $\frac{n-m}{n}$ for $m$ constant tends to 1 as $n$ tends to $\infty$, we observe that we have $k$ numerator-denominator pairs that all tend to 1. Multiplied together this shows us that the leftmost term tends to 1. We can pull $\frac{\lambda^k}{k!}$ out of the limit because $k$ is constant. Combined with our analysis of the first term, we need still need to solve

$$\lim_{n \to \infty} (1 - \frac{\lambda}{n})^{n - k}.$$

Because $k$ is constant and $n$ is shooting off to $\infty$, $n-k$ behaves the same as $n$ in our case, so we can rewrite the limit as

$$\lim_{n \to \infty} (1 - \frac{\lambda}{n})^n.$$

This limit is a thinly veiled version of a common exercise from standard first semester calculus courses. I challenge you to convince yourself that

$$\lim_{n\to \infty} (1 + \frac{a}{n})^{bn} = e^{ab}.$$

(Hint: Start by setting $y = \lim_{n\to \infty} (1 + \frac{a}{n})^{bn}$ and taking the natural log of both sides.)

In our case, $a = -\lambda$ and $b = 1$, so our last piece evaluates to $e^{-\lambda}$. Gluing our pieces together, we have the PMF of the Poisson distribution: if $X$ is a Poisson random variable, $P(X = k) = \frac{e^{-\lambda}\lambda^k}{k!}$. (Check Wikipedia, I dare you.)

Conclusion

When I saw the above, I thought it was a really nice use of mathematical formalism to ground a connection we first noticed more intuitively. I find that it is often in finding footing for these sorts of beautiful connections that math shines brightest.

(This post was inspired by Joe Blitzstein’s Stat110 course on YouTube. Your course is awesome and I’ve learned a ton. Thank you.)

The birthday problem

Mon, 28 May 2018 00:00:00 +0000

Introduction

After writing a post about the Monty Hall problem the other day, a friend of mine asked if I’d write one about another famous, counterintuitive probability problem known as the birthday problem. The problem asks a simple(-seeming) question: determine the smallest number of people who must be in a room in order for there to be a 50% chance that two of them share a birthday. (Assume every year has 365 days.)

People typically hear the words “fifty percent” and “birthday” and think something along the lines of: “If there are 365 possible birthdays, then I probably would need about 365/2 people. That’s… **stares up and to the right for a second**… 183 birthdays!” These very people are usually shocked to hear that the solution is far smaller than that, and the rest of this post will show how to calculate it using some probability.

Using probability theory

While the following observation is obvious, it unlocks a new way of solving a whole trove of probability problems: when we consider an event $A$, it either happens or it doesn’t. (We refer to the “it doesn’t” event as $A^C$, read “$A$ complement”.) Thus if the event $A$ happens with a certain probability $p$, then $A$ does not happen with $1 - p$. This allows us to write $P(A)$ in terms of $P(A^C)$ and vice versa. For our purposes, we encode this logic as $P(A) = 1 - P(A^C)$.

To solve the problem at hand, we need to come up with an expression for the probability that two people share a birthday in a room of $k$ people. If we call the aforementioned event $A$, we want an expression for $P(A)$. As per the above, if we can come up with an expression for $P(A^C)$, then we’ve effectively determined our expression for $P(A)$. In our case, the event $A^C$ is the event that in a room of $k$ people, no two share a birthday. The probability of that event is given by:

$$P(A^C) = \frac{\text{number of ways to assign unique birthdays to } k \text{ people}}{\text{number of ways to assign birthdays to } k \text{ people}}$$

For the numerator, if we have are looking at a room with 10 people in it, we have 365 birthdays to choose from for the first, 364 for the second, and so on, until you’ve assigned birthdays to all 10 people. The total number of ways is thus $365 \cdot 364 \cdot \dots \cdot 356$.

For a room with k people, we can generalize this to $365 \cdot \dots \cdot (365 - k + 1)$, so that expression is our numerator. To compute the denominator, we are thinking about a more relaxed version of the same assignment problem we solved to compute the numerator. When I say more relaxed, I mean that we don’t have to worry about assigning the same birthdays to people anymore. In our room of 10 people, this means we have 365 birthday choices for the first person, but we also have 365 for the second, third, fourth, fifth and so on. Thus, the total number of ways to assign birthdays to 10 people is $365^{10}$. For a room with $k$ people, this turns into $365^k$.

Now that we have the numerator and denominator, we can write down $P(A^C)$:

$$P(A^C) = \frac{365 \cdot \dots \cdot (365 - k + 1)}{365^k}$$

With this in hand, we can now use the fact that $P(A) = 1 - P(A^C)$ to express

$$P(A) = 1 - \frac{365 \cdot \dots \cdot (365 - k + 1)}{365^k}.$$

Now we just start trying values of $k$ until $P(A) \geq 1/2$. This first occurs when $k = 23$.

Intuition

One explanation for why this number is so intuition-defyingly low is that when people first think about this problem, they usually think that $k$ has to do with the number of people in the room, instead of noticing that the key number in this problem is actually the number of pairs of people in the room. While 23 people doesn’t seem like much, 23 people furnishes you with 253 pairs! Each pair has probability 364/365 of not having the same birthday, so the probability that none of the 253 pairs shares a birthday is

$$(\frac{364}{365})^{253} \approx 0.4995.$$

This means that the probability that some pair shares a birthday is approximately $1–0.4995 = 0.5005$. To see that the number of pairs is what matters, note that if we increase $k$ to 50, the probability that some pair shares a birthday is roughly 97%. By the time you have 75 at your party, the probability jumps to 99.9%.

Conclusion

As was the case with the Monty Hall problem, the birthday problem serves as an example of the way that a rigorous analysis is a great way to combat our sometimes errant intuition.

The Monty Hall paradox

Thu, 24 May 2018 00:00:00 +0000

Introduction

The field of probability is rife with counterintuitive results that show how necessary the rigor of mathematics is to correct understanding of certain situations. This post will be about the Monty Hall Problem. It isn’t hard to state, but the result is somewhat subtle, so I thought it’d be fun to write about.

The problem

The parlor-trick version of the problem goes as follows: You are on a game show and in front of you are three doors (labeled 1, 2 and 3). Two conceal goats and one hides a car. The car has a 1/3 probability of initially being behind each door. Your host, Monty, knows which door the car is behind. If you pick the door with the car behind it, you win. Monty asks you to select a door and you choose door 1. Monty then opens one of the two remaining doors (door 3, say), revealing a goat, and then asks you if you want to change your selection to door 2. Does it pay to take him up on his offer? Think about what you would do before continuing to read. What is your intuition telling you?

At first, many people are ambivalent. They argue that because I have two doors left, it’s equally likely that the car is behind door 1 as it is behind door 2. Thus, switching neither hurts nor helps. This argument isn’t quite correct, though; it doesn’t make use of information that Monty provided you with by opening one of the doors for you! It turns out that some knowledge of conditional probability would greatly increase your chances of going home with that new car. Let’s see why.

Law of total probability

Let’s see what a probabilistic argument tells us. From the fact that Monty opened door 3, we know that the car has to be behind door 1 or door 2. Let $S$ represent the event that the switching strategy wins the game. I think the simplest argument is the one that makes use of the Law of Total Probability (LoTP), which we will simplify to: Given a partition of a sample space (a set of events that are disjoint from one another and together make up the whole sample space) $B, B^C$, $P(A)$ (for $A$ in the same probability space as $B$) can be written as $P(A) = P(A|B)P(B) + P(A|B^C)P(B^C)$ (Note: $B^C$ is the complement of $B$. $B^C$ occurs if $B$ does not). (The above can be naturally generalized to an infinite partition of the sample space.) For an intuitive idea of what this means, this picture is helpful:

As you can see, part of $A$ intersects with $B$ and the other part intersects with $B^C$. The LoTP basically tells us that one way of computing $P(A)$ is to add up the probabilities that A occurred given that either (1) $B$ occurred, or (2) $B$ did not occur (i.e. $B^C$ occurred). Because $B$ either occurred or didn’t, this sum has to give the total probability of $A$. We are going to apply similar logic to our problem.

As is obvious, at the outset, the car is either behind door 1, door 2 or door 3. Let $D_i$ be the event that the car is behind door $i$. Because the $D_i$ are a partition of the sample space, I can use the LoTP, so that

$$P(S) = P(S|D_1)P(D_1) + P(S|D_2)P(D_2) + P(S|D_3)P(D_3)$$

At the beginning of the game, the car had an equal likelihood of being behind any of the three doors, we can fill in the right multiplicands in the sum above:

$$P(S) = P(S|D_1)\cdot \frac{1}{3} + P(S|D_2)\cdot \frac{1}{3} + P(S|D_3)\cdot \frac{1}{3}$$

Next, observe that $P(S|D_1)$ is the probability that switching wins given that the car is behind the door you initially chose. This probability is 0, because you would be switching away from the winning door. If the car is behind door 2 (note that in this case, door 3 was opened and is no longer in contention), switching always gets you the car, i.e. $P(S|D_2) = 1$. The same is true if the car is behind door 3 (and door 2 was opened). Hence

$$P(S) = 0 \cdot \frac{1}{3} + 1 \cdot \frac{1}{3} + 1 \cdot \frac{1}{3} = \frac{2}{3}.$$

In other words, switching wins the game for you with probability $2/3$!

Intuition

Let’s see if we can understand this a bit better without appealing to symbols. When you first chose door 1, you had a $1/3$ chance of winning the car. Stated a different way, you had a $2/3$ chance of not getting the car. When Monty opens one of the unchosen doors and reveals a goat, he is, in effect, providing you with new information. The probability of your initial choice being correct is still $1/3$, but with your updated understanding of the world, the $2/3$ probability that the car is not behind your initial choice of door is all resting on the single remaining door.

For a more extreme application of this line of reasoning, consider the a similar problem in which you start with 100 doors to choose from. You choose door 1 and Monty opens 98 of the remaining doors, revealing goats. Originally, your choice held a $99/100$ probability of being incorrect. Initially, that $99/100$ was spread evenly over the 99 doors you didn’t select. With each door Monty opened, that $99/100$ was condensed to fewer and fewer doors, so that by the very end, there is a single door holding all of that probability that you were initially wrong. You’d be crazy not to switch.*

Conclusion

The Monty Hall paradox is a fun, relatable problem that is a terrific example of the need for formality in thinking about probabilities. Our intuition is a powerful tool, but for as many problems as it allows us to solve, it yet more often leads us astray. In such moments it is crucial to have a systematic, rigorous approach with which to check ourselves and make sure that we remain on sound logical footing.

*For the more symbolically inclined reader, I want to quickly use the logic from the simple cases to show that it in fact always pays to switch doors. Suppose we generalize the problem so that there are $n\geq 3$ doors and after you select one of them, Monty opens $m$ of the remaining doors ($1 \leq m \leq n - 2$) and then asks you if you want to switch. In this case, after opening $m$ doors, the $(n-1)/n$ probability that you were initially wrong gets evenly distributed across the $n - 1 - m$ remaining doors, so that each door carries a probability of $(n-1)/(n(n - 1 - m))$. We complete the argument by verifying that this quantity is always bigger than $1/n$. To do this, note that showing $1/n < (n-1)/(n(n - 1 - m))$ is the same as showing $1 < (n-1)/(n - 1 - m)$, which is clear.

The mean value theorem

Tue, 22 May 2018 00:00:00 +0000

Introduction

Derivatives are used across many different fields of engineering, physics and mathematics to analyze the ways that continuous quantities change. Although the definition of derivative that we talk about in calculus courses today works well, it wasn’t always so simple. Coming up with a way to talk about derivatives where we could both understand it intuitively as well as give ourselves the right machinery to prove useful results about them took centuries and a bunch of mathematical legwork that I thought would be worth exposing a small part of.

In particular, in this post I want to define derivatives and lead up to the proof of the Mean Value Theorem. It and its generalizations are some of the most important and useful results about derivatives that we have and I thought that proving the most elementary version would be accessible and somewhat fun!

(Some small amount of calculus is assumed, but it isn’t strictly necessary.)

Without further ado, here we go!

Defining the derivative

The first thing we need to do is come up with a rigorous and useful definition of a derivative. To help draw this intuitive picture, imagine the following. You have two points on the $x$-axis that are pretty close together. We’ll call one point $x$ and the other $c = x + \text{a little bit}$. If our curve is given by $f(x)$, we can denote the slope of the line through $(x,f(x))$ and $(c, f(c))$ as the change in $y$ values divided by the change in $x$ values (aka rise over run). In other words, the slope $m$ of the line would be given by

$$m = \frac{f(c) - f(x)}{c - x}.$$

This isn’t anything new. You’ve known how to calculate slopes since middle school. The question of slope becomes more difficult, however, when you try to modify the usual notion of slope of a line through two points to come up with an analogous description for the slope of tangent line to a curve at one point.

(Note: I will assume for the remainder of this post that you have some familiarity with limits and continuity. If you don’t, you might want to browse the web and review those briefly. Even if you don’t have the requisite background, I’ll do my best to explain the concepts in a non-technical, intuitive way.)

To do this, we have to use some stuff from introductory calculus. Intuitively, what we’re going to do is let “a little bit” get smaller and smaller, so that $x$ and $c$ get closer and closer together. As the distance between them gets smaller and smaller, the slope of the secant line through $(x,f(x))$ and $(c,f(c))$ approximates the slope of the tangent line at $x$. Put succinctly, the slope of the tangent line of a curve at a point $c$ is the limit of the slopes of secant lines between $(c,f(c))$ and points $(x, f(x))$ where $x$ gets arbitrarily close to $c$. Put mathematically

$$f’(c) := \lim_{x \to c} \frac{f(c) - f(x)}{c - x}.$$

We say that the function $f$ is differentiable at $c$ if the limit above exists. This idea of differentiability at a point can be extended naturally to a set of points.

In a typical real analysis course, after laying definitions down, you get familiar with the definitions by proving some of the usual facts about derivatives. These are things like $(f + g)’(c) = f’(c) + g’(c), (kf)’(c) = kf’(c)$ ($k$ is a constant and $f, g$ are differentiable at $c$). I’ll leave proving these and other facts like the power, product, quotient and chain rules as an exercise for the inclined reader in favor of venturing into more interesting territory.

In particular, I want to lead up to a proof of the mean value theorem, one of the most fundamental and useful facts about derivatives. It’s one of those theorems that looks sort of obvious when you just draw some pictures (a la intermediate value theorem), but it’s actually a pretty deep result. It and its generalizations are some of the the most useful tools mathematicians have had at their disposal to tackle derivatives since their inception in the 17th century. We’ll get to our main result via a few intermediate ones, starting with the interior extremum theorem (IET).

The interior extremum theorem

The interior extremum theorem tells us something about the connection between a function’s extreme values and points at which the derivative vanishes:

Interior Extremum Theorem: Let $f$ be differentiable on the open interval $(a,b)$. Then if $f$ attains a maximum (or a minimum) on $(a,b)$ at some point $c$, then $f’(c) = 0$.

Proof: Because $c$ is in the open interval $(a,b)$, we can find two sequences, $x_n$ and $y_n$, such that both converge to $c$ and $x_n < c < y_n$ for all $n$. Given these sequences, we have

$$f’(c) = \lim_{n \to \infty} \frac{f(c) - f(x_n)}{c - x_n} \geq 0,$$

because $f(c) - f(x_n)$ is nonnegative (because $f(c)$ is a maximum value of $f$ on $(a,b)$) and $c - x_n$ is positive for all $n$ (because $x_n < c$). On the other hand, we also see that

$$f’(c) = \lim_{n \to \infty} \frac{f(c) - f(y_n)}{c - y_n} \leq 0,$$

because the numerator is again nonnegative, but this time the denominator is negative for all $n$. Thus, we have $0 \leq f’(c) \leq 0, $ so $ f’(c) = 0$. QED.

Note that the converse is not necessarily true. It isn’t necessarily the case that if the derivative is 0 at a point $c$, then $f(c)$ is maximum or a minimum of $f$ on $(a,b)$. Consider the function $f(x) = x^3$ on the interval $(-1,1)$. By the power rule, $f’(x)=3x^2$ (and it is defined on $(-1,1)$). At $x = 0$,$ f’(x) = 0$, but $-1 = f(-1) < f(0) < f(1) = 1$, whence $f$ takes neither a minimum nor a maximum value at 0 even though the derivative vanishes there. Although its converse doesn’t hold, the IET does furnish us with a pretty powerful computational tool with which to solve optimization problems, the solutions to which often begin with “First take the derivative of the function you want to optimize and find the $x$ values at which it equals 0…”.

Rolle’s theorem

Next, I want to use the IET to prove another result that isn’t so hard to convince yourself of with pictures. It’s called Rolle’s theorem. It tells us that if f takes the same value at two ends of an interval and is differentiable on said interval, that there must be some point within the interval at which the derivative is zero. I actually think it’s worthwhile to draw some pictures and convince yourself intuitively that this result makes sense before you read the proof below:

Rolle’s Theorem: Let $f$ be continuous on $[a,b]$ and differentiable on $(a,b)$ with $f(a) = f(b)$. Then there is a point $c \in (a,b)$ such that $f’(c) = 0$.

Proof: If $f$ is constant on $(a,b)$, then $f’$ is identically zero on that interval, in which case there’s nothing to prove. If $f$ is non-constant on the interval, f takes on a maximum or minimum value within $(a,b)$. By applying IET, we’re done. QED.

The constant case above is simple. In the non-constant case, what we’re basically saying is that starting at $f(a)$, the value of $f$ rises (or falls) as the $x$ values move to the right. At a certain point, though, they need to start falling back down (rising back up) to the value $f$ took at $a$. The point at which this fall (rise) happens is the point we sought.

Mean value theorem

With these results, we are now equipped to prove the mean value theorem. We will do this by reducing it to a simple application of Rolle’s theorem.

Mean Value Theorem: Let $f$ be continuous on $[a,b]$ and differentiable on $(a,b)$. There is a point $c \in (a,b)$ such that $f’(c) = \frac{f(a) - f(b)}{a - b}$.

(Note that this is a more general version of Rolle’s theorem. Draw some pictures and convince yourself of the theorem before reading the proof below.)

Proof: Let’s first write down what the equation of the line through $(a, f(a))$ and $(b, f(b))$ would look like. On the one hand, we know that the slope of the line is $\frac{f(a) - f(b)}{a - b}$. On the other, because we are talking about a line, that the slope of the same line through any $(x, y(x))$ and $(a, y(a) = f(a))$ should be the same, so we have $\frac{f(a) - f(b)}{a - b} = \frac{y(x) - f(a)}{x - a}.$ We can get a general equation of the line by solving for f(x) (multiply both sides by $x - a$ and then add $f(a)$ to both sides). Doing so, we see that

$$y(x) = \frac{f(a) - f(b)}{a - b}(x - a) + f(a).$$

In order to transform this into an instance of Rolle’s theorem, what we’re going to do is build a new function $d$ that represents the differences between the curve and the line. Namely, we have

$$d(x) = f(x) - y(x) = f(x) - \biggr[ \frac{f(a) - f(b)}{a - b}(x - a) + f(a) \biggr].$$

Clearly, $d(a) = d(b) = 0$. By the rules about combining continuous functions, $d(x)$ is continuous on $[a,b]$ and by rules for combining differentiable functions, $d(x)$ is differentiable on $(a,b)$. This means that we can apply Rolle’s Theorem to $d(x)$ so that we have a $c \in (a,b)$ such that $d’(c) = 0$. That is, there is a $c$ such that $d’(c) = f’(c) - \frac{f(a) - f(b)}{a - b} = 0$. Rearranging a bit, we see that at that $c$, we have $f’(c) = \frac{f(a) - f(b)}{a - b}$, proving the theorem. QED.

Conclusion

Several of the important results (e.g. L’Hosptial’s Rule for solving limits) that you learn about later on in calculus and real analysis courses use the Mean Value Theorem as their driving force. I haven’t seen many theorems that are both simple to conceptually grasp and also fundamental building blocks of important areas of mathematics. I think the derivative is a phenomenal example of the power and necessity of good definitions and the type of ingenuity that appears all across mathematics, albeit more subtly sometimes.

Two puzzles from Martin Gardner

Wed, 21 Feb 2018 00:00:00 +0000

Introduction

In this post I thought I’d write about two fun mathematical puzzles I came across recently (in Martin Gardner’s book Mathematical Puzzles). Neither requires much mathematical sophistication, but they are both good examples of how the ability to think logically and model a problem technically are important problem-solving tools.

Round trip

Problem

An airline runs a round trip from city A to city B and back. The plane travels at a single constant speed. During one trip, there is no wind blowing during either leg. During a second trip, there is a constant wind that blows from A to B during both legs of the trip. Is the first trip longer than, shorter than, or the same length as, the second trip? (Take a minute to think about it before reading the solution.)

Solution

Let $d$ be the distance from A to B. If $r \cdot t = d$ (rate x time = distance), then $t = d/r$. What we are going to do is write equations for $t$ in each of the wind and no-wind cases and see if we can determine the relationship between them. In the no-wind case, if the plane’s constant speed is given by $r$, then $t_1 = 2d / r$. In the wind case, let’s say the wind speed is given by some $w > 0$. Then the plane travels at a rate of $r + w$ on the way from A to B and at a rate of $r - w$ on the way back. Thus, the total round trip time is given by

$$ \begin{align*} t_2 &= \frac{d}{r + w} + \frac{d}{r - w} \\\\ &= \frac{d(r - w) + d(r + w)}{r^2 - w^2} \\\\ &= \frac{2dr}{r^2 - w^2} \end{align*} $$

Notice that because $w > 0$, $w^2 > 0$, so that

$$ t_2 = \frac{2dr}{r^2 - w^2} > \frac{2dr}{r^2} = \frac{2d}{r} = t_1, $$

showing us that the trip with wind takes longer.

Cornerless chessboard

Problem

You have an 8x8 chessboard and 32 2x1 dominoes. As is hopefully clear, you can cover all 64 squares on the chessboard with the 32 dominoes. Now suppose I remove two opposite corners from the board and take away one domino. Can you cover the 62 remaining squares with your 31 remaining dominoes? If so, show how. If not, prove it.

Solution

You cannot cover the remaining squares. To see why, the key observation is that each domino covers one black square and one white square. If you remove opposite corners, you are removing two squares of the same color. In order to cover what’s left, you would need to cover 30 black squares and 32 white squares, but per our observation, 31 dominoes can only cover 31 black squares and 31 white squares! Thus, covering the remaining squares with 31 dominoes is indeed impossible.

Conclusion

While delving into deep higher level mathematics is certainly fun, it’s fun to pause every so often and have a little fun with some less-involved puzzles; I hope you’ve enjoyed :)

Fundamental theorem of arithmetic

Wed, 10 Jan 2018 00:00:00 +0000

Introduction

Often when I decide to write a post about some theorem or concept, the best are those that are both deep and easy to explain. These are admittedly hard to come by, but upon doing a bit of review of some basic number theory (study of properties of whole numbers), I stumbled across the Fundamental Theorem of Arithmetic (FToA) and thought that it was an almost perfect candidate.

The FToA is about the atomic nature of prime numbers, which, for those unfamiliar, are numbers whose only divisors are themselves and 1. The FToA basically tells us that each whole number is made up of some unique product of primes. There are proofs littered across mathematics that make use of either or both of the existence of such a decomposition and its uniqueness. For such a useful theorem, the proof is quite accessible and I thought it was worth writing about, so here we go.

Proving it

The theorem can be stated as follows:

Theorem: Every positive whole number $n > 1$ can be written as a unique product of prime numbers.

Proof: The proof has 2 parts. We will first show that the decomposition exists and then we will show that it’s unique. For existence, we will use induction. The base case, when $n = 2$, is trivial. 2 is the product of… well… 2. So now we assume that $n > 2$ and then every number $1 < k < n$ has such a decomposition. If $n$ is prime, we’ve succeeded (same logic we used for the base case). If $n$ is composite, then we can write $n = ab$ with $a$ and $b$ both strictly smaller than $n$. By the induction hypothesis, both $a$ and $b$ have prime factor decompositions, so $n$ does as well.

(If you aren’t familiar with induction, what we’ve done is shown that in the very smallest case, we have what we want. We’ve also shown that if what we want to prove holds for $2..n-1$, it also holds for $n$. Thus, if it holds for 2, it holds for 3. If it holds for 2 and 3, it holds for 4. If it holds for 2, 3 and 4, it holds for 5. Continuing this way forever, we see that every possible $n$ has the property we want.)

Now, for uniqueness. The way we typically show uniqueness in math is by supposing that there are 2 distinct versions of whatever it is we think is unique and showing that they must actually be the same. This is the technique we employ here. Suppose that we could write n two ways. That is, suppose we could validly write both of $n = p_1^{e_1}p_2^{e_2}\dots p_k^{e_k}$ and $n = q_1^{f_1}q_2^{f_2}\dots q_m^{f_m}$ where the $p_i$ (distinct from one another) and $q_j$ (distinct from one another) are prime and the exponents are all positive. Notice that the first factorization has $k$ primes, the second has $m$ and that the exponents are not necessarily all the same (yet). We are truly assuming that we have 2 factorizations that are, at least initially, potentially completely different from one another. If we can show that $k = m$ and that $e_i = f_i$ for each $i$, then we’ve accomplished our objective.

We can assume, without loss of generality, that the $p_i$ are in increasing order (if they’re not, we can relabel them so that they are without affecting any part of the proof). Let’s look at $p_1$. It must divide one of the $q_j$ (this stems from the fact that if a prime divides a product of numbers, it must divide at least one of the numbers — this is not hard to prove using induction… try it?). Let’s say that $p_1|q_j$ for some $j$. Reorder the $q_j$ so that $p_1|q_1$. Because both $p_1$ and $q_1$ are prime, this means that $p_1 = q_1$. So divide each factorization by $p_1$ and $q_1$ respectively and repeat this process until you run out of primes in one of the decompositions. If one of the factorizations runs out before the other, then we will have written 1 as a product of primes greater than 1, which is impossible. They must thus run out at the same time, whence $k = m$, $e_i = f_i$ for each $i$ and our two factorizations must have been one and the same. QED.

Conclusion

Fundamental theorems abound all over mathematics. There are fundamental theorems of arithmetic, algebra, calculus, cyclic groups, linear algebra and others. This one, though, really gets at the very makeup of a mathematical entity that all of us understand, at least on a basic level: the positive whole numbers. Cool, no?

Euler's Identity

Mon, 01 Jan 2018 00:00:00 +0000

Introduction

In this post I want to show a few different ways of proving that $e^{i\theta} = \cos\theta + i\sin\theta$. It’s a cute illustration of how it’s often possible and rather cool to look at and solve problems in different ways.

Approach 1

The first technique is one I encountered toward the end of my tenure as a Calc 2 TA last semester as I was going over Taylor and MacLaurin series with my students. If we look at the MacLaurin series for $\sin \theta$ and $\cos\theta$ are

$$ \begin{align*} \cos\theta &= 1 + \frac{\theta^2}{2!} + \frac{\theta^4}{4!} - \dots\\ \sin\theta &= \theta - \frac{\theta^3}{3!} + \frac{\theta^5}{5!} - \dots, \end{align*} $$

$$i\sin\theta = i\theta - \frac{i\theta^3}{3!} + \frac{i\theta^5}{5!} - \dots$$

Now, let’s look at the MacLaurin series for $e^{i\theta}$. Because

$$e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \dots, $$

we have

$$ \begin{align*} e^{i\theta} &= 1 + i\theta + \frac{i^2\theta^2}{2!} + \frac{i^3\theta}{3!} + \dots\\ &= 1 + i\theta - \frac{\theta^2}{2!} - \frac{i\theta^3}{3!} + \dots \end{align*} $$

which, miraculously, is exactly what we get if we interleave (sum up) the terms of $\cos\theta + i\sin\theta$.

Approach 2

The next approach uses techniques from a first course in calculus. We first observe that if $e^{i\theta}$ is going to be the same as $\cos\theta + i\sin\theta$, then the fraction $\frac{cos\theta + i\sin\theta}{e^{i\theta}} = (\cos\theta + i\sin\theta)e^{-i\theta} = 1$. We will show that the second equality holds.

To do this, first define $f(\theta) = (\cos\theta + i\sin\theta)e^{-i\theta}$. Next, we take the (rather annoying) derivative of $f(\theta)$

$$ \begin{align*} f’(\theta) &= e^{-i\theta}(-\sin\theta + i\cos\theta) + -ie^{-i\theta}(\cos\theta + i\sin\theta)\\ &= -e^{-i\theta}\sin\theta + ie^{-i\theta}\cos\theta +-ie^{-i\theta}\cos\theta + e^{-i\theta}\sin\theta\\ &= 0\dots \end{align*} $$

If $f’(\theta) = 0$ at all values of $\theta$, $f$ is constant! Which constant? Let’s plug in $\theta = 0$ and find out!

$$f(0) = (\cos 0 + i\sin 0) e^{-0i} = (1 + 0)(1) = 1$$

If $f$ takes the value 1 when $\theta = 0$ and $f$ is constant, it must take the value 1 everywhere. To sum up, we have

$$f(\theta) = (\cos\theta + i\sin\theta)e^{-i\theta} = 1.$$

Rearranging the second equality by cross multiplying, we see that Euler’s identity holds.

Approach 3

The last technique is my favorite. It uses a bit of linear algebra in concert with differential equations to produce what I think is the most illuminating proof of Euler’s identity. First, consider the differential equation

$$f’’ = -f.$$

A differential equation is what is a kind of functional equation. Rather than trying to find the value of a real-valued variable, we are trying to find a function whose derivatives satisfy a given relationship. In this case, we want to find a function $f$ such that $f$’s second derivative is the same as $f$.

We first note that $\cos\theta$ and $\sin\theta$ are both solutions to this equation:

$$ \begin{align*} (\sin\theta)’’ &= -\sin\theta \\ (\cos\theta)’’ &= -\cos\theta. \end{align*} $$

Because our differential equation involves second derivatives, it’s solutions are sort of analogous to those of a quadratic equation: which is to say, there are two! Formally, we say that the solution space is a vector space of dimension 2. If $\sin\theta$ and $\cos\theta$ are linearly independent solutions, they form a basis of the solution space, which means that every solution to our differential equation must can be written in the form the form $a\cos\theta + b\sin\theta$ for some constants $a,b$.

To see that $\sin\theta$ and $\cos\theta$ are indeed linearly independent, suppose that for all $\theta \in [0,2\pi]$, $a\cos\theta + b\sin\theta = 0$. Let’s pick a particular $\theta$ value in this interval. If $\theta = \pi/2$, then we have $a \cdot 0 + b \cdot 1 = b = 0$ (where the $=0$ at the end is because we supposed that $a\cos\theta + b\sin\theta = 0$ for all $\theta \in [0,2\pi]$). If we pick another, say $\theta = 0$, we have $a \cdot 1 + b \cdot 0 = a = 0.$

So far, we’ve shown that if $a\cos\theta + b\sin\theta = 0$, then we know that $a = b = 0$. This constitutes a proof that $\sin\theta$ and $\cos\theta$ are linearly independent. Because we said the solution space is of dimension 2, they are a basis for the solution space to our original equation.

We can separately observe that $e^{i\theta}$ is also a solution to our equation because

$$ \begin{align*} (e^{i\theta})’ &= ie^{i\theta}\\ (e^{i\theta})’’ &= -e^{i\theta}. \end{align*} $$

Because $\sin\theta$ and $\cos\theta$ form a basis, we know that

$$e^{i\theta} = a\cos\theta + b\sin\theta.$$

for some yet unknown constants $a,b$. Now it just remains to figure out what $a$ and $b$ are. (Can you see what they should be?)

To find $a$ and $b$, we note that if $f(\theta) = e^{i\theta}$, then $f(0) = 1$ and $f’(0) = i$. Using the first condition, we have

$$1 = a \cdot 1 + b \cdot 0 = a.$$

Using the second (taking the derivative of both sides before plugging 0 in), we have

$$i = -a\sin 0 + b\cos 0 = 0 + b = b. $$

Putting these both together, we have $e^{i\theta} = \cos\theta + i\sin\theta,$ which is what we wanted.

The Cantor set

Fri, 27 Oct 2017 00:00:00 +0000

Introduction

In this post, I want to talk about a mathematical construct I read about last night that is just downright fascinating. It’s a great example of the way math can help us make sense of the otherwise-opaque. I give you: the Cantor Set!

The Cantor set

Let $C_0$ be the interval $[0,1]$. Now remove the middle third of the interval to obtain $C_1 = [0, \frac{1}{3}] \cup [\frac{2}{3}, 1]$. Next, remove the middle thirds of the intervals leftover in $C_1$ to construct $C_2 = ([0, \frac{1}{9}] \cup [\frac{2}{9}, \frac{1}{3}]) \cup ([\frac{2}{3}, \frac{7}{9}] \cup [\frac{8}{9}, 1])$ Iteratively continue to remove thirds from the remaining intervals. Doing this, we get a sequence of unions of closed intervals $C_0, C_1, C_2, \dots$. The Cantor Set $C$ is defined as the intersection of the $C_i$; mathematically, we write $C = \cap_{i=0}^\infty C_i$. Visually, the $C_i$ look like

where the topmost line is $C_0$, the second line is $C_1$, and so on.

At infinity

The rest of this post will be spent trying to understand how $C$ is composed. What is left in it after an infinite sequence of cuts?

First notice that 0 and 1 are not deleted during any stage of the process. Generalizing this point, we can see that if some value in $[0,1]$ is an endpoint of some interval at some point during our chain of cuts, it never gets removed. For example, when we delete the middle thirds of the left and right parts of $C_1$ to obtain $C_2$, observe that we don’t touch $0, \frac{1}{3}, \frac{2}{3}$, or $1$. Formally, we would argue that if $x$ is an endpoint of $C_i$, we know two things:

$x \in C_k$ for $k \leq i$.
$x$ is not removed during the construction of any of the other $C_k$ for $k > i$. Thus, $x$ is in every one of the $C_i$ so by definition of $C$ (as the intersection of the $C_i$), $x \in C$. Is there anything else in $C$? If the only numbers left are endpoints of intervals, then $C$ would be a subset of $\mathbb{Q}$ and we would thus conclude that $C$ is countable.

More on this in a minute. One way we might try convince ourselves that there is indeed not much else in $C$ beside “endpoints” is to think about how much of the interval $[0,1]$ is left once we’ve made all of our cuts. To do this, we just need to think about how much we delete on each pass. On the first pass, we delete one interval of length $1/3$ ($1/3^1$). On the second, we delete 2 intervals of size $1/9$ ($1/3^2$). On the third, we delete, 4 intervals of size $1/27$ ($1/ 3^3$). Generalizing this pattern, we see that on the $i$th iteration, we cut $2^{i-1}$ intervals, each of size $1/3^i$. To count up how much length we cut, we just need the sum of

$$\frac{1}{3} + 2\biggr(\frac{1}{9}\biggr) + 4\biggr(\frac{1}{27}\biggr) + \dots + 2^{i-1}\biggr(\frac{1}{3^i}\biggr) + \dots = \frac{1}{3}\sum_{i = 1}^\infty \ \biggr(\frac{2}{3}\biggr)^{i-1}$$

The series is geometric with ratio less than 1, so the sum evaluates to

$$\frac{1}{3}\biggr(\frac{1}{1 - \frac{2}{3}}\biggr) = \frac{1}{3}\cdot 3 = 1.$$

But that’s kinda odd… we started with an interval of length 1, and have cut out… all of it? (Mathematically, we say that $C$ has zero length.)

So is it countable?

At this point, you (as I did) probably thought that the buck stopped here. As expected, $C$ is sparse and small, probably even countable. As with many things in set theory, there is a bit more depth yet to investigate. As our final act, we’re going to show that not only is $C$ not sparse, it’s actually uncountable!

To do this, we are going to take a preliminary result for granted, namely that the set of all infinite sequences of 0s and 1s is uncountable. (If you’re feeling adventurous, take a stab at proving this yourself. If you’re feeling a little less adventurous but you’re still in the mood for a challenge, a hint is that the proof is a diagonalization argument much like Cantor’s proof that the real numbers are uncountable.)

The one-to-one correspondence

We now construct a one-to-one correspondence between sequences of 0s and 1s and elements of $C$. For each element $c \in C$, define $a_i$ — $i \geq 1$ — to be 0 if $c$ falls in the left part of $C_i$, and 1 if it falls in the right part. (Note: if $c$ is in the left part of $C_{i-1}$, the “left” and “right” parts of $C_i$ refer to the left and right parts that result when we cut out the middle third of the left part of $C_{i-1}$.)

Read the sentence in parentheses over again. It is written with unfortunately confusing language but it’s crucial understanding the construction.

To see that this is actually a one-to-one correspondence, note that given a sequence of 0s and 1s, we can “follow” the sequence to pinpoint the exact, unambiguous element of $C$ that the particular sequence represents. Conversely, the construction of the sequence from two paragraphs ago gives us a way to take an element of $C$ and come up with a unique sequence by looking at exactly where the element falls with respect to each of the $C_i$. Given that the set of infinite sequences of 0s and 1s is uncountable, this means that $C$ is actually uncountable too!

Conclusion

When I first read this, my mind was literally blown. Isn’t it amazing?! We’ve somehow come up with a way to remove all of the length from an interval without diminishing its size in the least! By taking our original inquiry and injecting our investigation with a rigorous mathematical approach, we took $[0,1]$, which has length 1, removed all of its length via our construction of $C$, and yet somehow didn’t affect its cardinality.

Infinities don’t always play nice, but that’s why we love them.

The Alternating Series test

Wed, 27 Sep 2017 00:00:00 +0000

Introduction

While I was in college, I spent a few semesters TAing Calc II (Calc BC if you do it in high school). Both when I took the class and when I TAed it, I found the part of the course devoted to infinite series the most interesting by far. It was (and still is) amazing to me that you can — informally — add together an infinite number of terms and get a finite result. In fact, until the 19th century, the above was considered paradoxical and incorrect. To help determine whether or not different series converge, mathematicians developed a suite of tests whose statements are simple enough to be taught to introductory calculus students but whose power cannot be overstated.

In this post, we will prove that the “alternating series test” is valid. While the explanation might be a little bit involved, I hope to include all of the necessary background here. Hopefully, all you’ll need to follow this post is some patience and willingness to challenge yourself a little; this proof certainly challenged me, but I think the effort was well worth it.

If you are unfamiliar with what an infinite series is, see Series (mathematics) - Wikipedia.

The problem

Before we start, we should understand the problem we are trying to solve. Lets say we have some infinite series, and let’s write it down as

$$\sum_{n = 1}^\infty a_n = a_1 + a_2 + a_3 + \dots.$$

The above series may converge or it may not. We can modify this rather plain series by adding together the terms of the same underlying sequence with alternating signs, forming what is known as an alternating series. Symbolically, an alternating series has the form

$$\sum_{n=1}^\infty (-1)^{n-1}a_n = a_1 - a_2 + a_3 - \dots.$$

(Notice that the alternation comes from the fact that -1 to an even power is 1 and -1 to an odd power is -1.) If the non-alternating series converges to a finite sum, then the alternating series clearly does as well.* If not, though, would introducing alternation maybe force convergence? If so, we say that $\sum (-1)^{n-1}a_n$ converges conditionally; if not, oh well… some things just aren’t meant to be.

Preliminaries

More specifically, the AST is concerned with what conditions we need to place on our underlying sequence, $a_n$, so that alternation implies convergence. It turns out that whether an alternating series converges is quite easy to check; the only two things we need to verify are:

The terms of the original (non-alternating) sequence are decreasing. Symbolically, we want $a_1 \geq a_2 \geq a_3 \geq \dots$
$a_n \to 0$.

Our aim for the rest of this post is to prove that if $a_n$ satisfies (1) and (2), then $\sum (-1)^{n-1} a_n$ converges to a finite sum. In order to do this, though, we need a bit of machinery from real analysis, which we discuss next.

Completeness and the monotone convergence theorem

Real analysis (in an oversimplified sense) is the study of properties of the real numbers and functions of a real variable (things like continuity, differentiability, integrability and some other related stuff). The first thing one often does in a real analysis class is to formally discuss what the real numbers are, why we need them, and how to construct them. To go into all of that here would take us pretty far afield, but in short, the real numbers is the first example of a truly continuous set in the sense that there are no holes. (Although the rational numbers are dense, there are holes where irrational numbers — e.g. $\sqrt{2}$ — should be. Integers and natural numbers clearly have holes, e.g. between 1 and 2.)

One of the challenges students (myself included) typically face when discussing and trying to wrap your head around the above for the first time is how to rigorously define this idea that there are no holes (mathematically called completeness). Most commonly, the definition you settle on is that every sequence of real numbers that has an upper bound has a least upper bound (same with lower bounds and a greatest lower bound).

A textbook I’ve been going through actually shows that in addition to the above characterization — often referred to as the Axiom of Completeness — there are (at least) 4 other equivalent ways to characterize completeness, one of which we will use in the proof that we’re going to attempt below. It’s called the Monotone Convergence Theorem (MCT), and it states: any sequence (of real numbers) that is (1) bounded and (2) monotone converges. (A sequence is said to be bounded if there is a number $M$ such that all terms of the sequence are contained in the interval $[-M, M]$. A sequence is monotone increasing if every term is greater than or equal to the preceding term and monotone decreasing if each term is less than or equal to the preceding term; if a sequence is monotone increasing or monotone decreasing, we say it’s monotone, as you might expect.)

The MCT is pretty powerful. Boundedness and monotonicity are often intuitive properties that we can hand-wavily infer about a sequence we are examining and with MCT, we can transform those properties into what is often times the holy grail of sequence (and series) analysis: convergence!

Another technicality we need to address before we tackle our main result is that we need to clarify what we mean when we write $\sum_{n=1}^\infty a_n$. What does it actually mean for $\sum_{n=1}^\infty a_n$ to “converge” to some finite value $S$?

When we say that the sum converges, what we mean is that as we add more and more terms, we get closer and closer to $S$. In other words, to know whether a series converges, we need to know whether or not what we call the sequence of partial sums of $a_n$ — $a_1, a_1 + a_2, a_1 + a_2 + a_3, \dots$ — converges. Technically speaking, let $s_n$ denote the $n$th partial sum of the $a_n$; that is, $s_n = a_1 + a_2 + \dots + a_n$. Then saying that$ S = \sum_{n=1}^\infty a_n$ is the same as saying $S = \lim_{n \to \infty} s_n$. That last sentence is just formalism; what is really important here is that to show that some series converges to a number, we need only show that the partial sums of the sequence’s terms converge. With this in mind, we’re ready for the main result.

(Before continuing, make sure that you understand the main points from the previous two paragraphs.)

Before writing a proof, I often find it helpful to have context for the way the proof is going to unfold so that as I’m writing the proof, I’m able to remember where I am and how it’s supposed to help me get where I want to go. The statement we are trying to prove here is:

Alternating Series Test: If $a_n$ is decreasing and $a_n \to 0$, then $\sum (-1)^{n-1} a_n$ converges.

Proof sketch

The proof will proceed by the following steps:

We will consider two subsequences of the sequence of partial sums: the first will be the subsequence of partial sums that add up an odd number of terms and the second will be the subsequence of partial sums that add up even numbers of terms. We will call the “odd” one $s_{2n+1}$ and the “even” one $s_{2n}$.
We will show that each of the subsequences are bounded.
We will show that each of the subsequences are monotone.
(2) and (3) imply that both converge by MCT, say to limits $L_1$ and $L_2$ respectively.
We will show that $L_1 = L_2$ and will call the shared limit $L$.
We note that the “whole” sequence of partial sums can be made up by interleaving terms of the even and odd subsequences.
If the two subsequences converge to L and we can interleave them to form the original sequence, then it also must converge to $L$. The original sequence was the equivalent partial sum expression of our alternating series, so the proof is complete.

Proof

Now, for the proof.

The odd subsequence looks like:

$$a_1, a_1 - a_2 + a_3, a_1 - a_2 + a_3 - a_4 + a_5, \dots$$

Recall that the $a_n$ are decreasing. This means that subtracting $a_2$ and then adding back a little less than $a_2$ to $a_1$ is going to leave you slightly short of $a_1$. When you subtract $a_4$ from and then add $a_5$ to $a_1 - a_2 + a_3$, you’re going to end up slightly short of $a_1 - a_2 + a_3$. Extending this logic, we arrive at two conclusions. First, we see that the odd partial sums are monotonically decreasing because with every pair of $a_k$ that we tack on to obtain a successive term, we subtract some amount and then add back a bit less than we got rid of. Second, we see that once we depart from $a_1$ (and start our adding and subtracting madness), we never quite make it back. In other words, we can bound the odd partial sums by $a_1$.

(There are formal, symbolic ways of representing all of this, but if you understand the above line of reasoning, you’ve understood the gist, IMO.)

Next, we note that for each $k$, the partial sum $s_{2k} \leq s_{2k+1}$ because the last term of every odd partial sum adds some small positive amount to the previous sum, which was made up of an even number of terms. For example, when $k = 1$, we have

$$s_2 = s_{2 \times 1} = a_1 - a_2 + a_3 - a_4 \leq a_1 - a_2 + a_3 - a_4 + a_5 = s_{2 \times 1 + 1} = s_3.$$

Thus, the even partial sums must be bounded too!

Further, notice that the even partial sums monotonically increase. We can see this by observing that each time we tack a pair of $a_i$, we add a little bit and then subtract a little bit less. When we go from $s_2 = a_1 - a_2$ to $s_4 = a_1 - a_2 + a_3 - a_4$, we take $a_1 - a_2$ and change it slightly by adding $a_3$ and then subtracting a little bit less than $a_3$ ($a_4$).

But wait! We now have established that both the odd and even sums are monotone and bounded, so they must both converge! Let’s call the limit of the sequence of odd partial sums $L_1$ and the limit of the even partial sums $L_2$. To prove that $L_1$ and $L_2$ are the same, all we need to do is show that the difference between them is 0. To see this, we simply observe that

$$\lim_{k \to \infty} s_{2k+1} - s_{2k} = \lim_{k \to \infty} a_{2k+1} = 0.$$

Thus $L_1 = L_2$; we will henceforth refer to the limit as $L$.

We can reconstruct the sequence of partial sums representing our original series by interleaving terms of the odd and even subsequences. Because both subsequences tend to $L$, we can conclude that the sequence constructed from the interleaved subsequences also tends to $L$. This completes the proof, as we’ve provided the desired finite limit for our alternating series’ partial sums.

Conclusion

Sometimes we’re taught formulas and tests in class and are — rather unfortunately — not challenged to understand why they work as well as they do. As I’ve spent a little bit of time going back through some of the things I took for granted when I first encountered them, I’ve found that the stuff behind the curtain is often very interesting and illuminating. In this particular case, I hope you have too.

*If adding a bunch of positive (the same type of argument will work if some terms are negative) terms together gives some finite positive result, then subtracting some of the terms instead of adding them will certainly give a finite result as well. The sum of the alternating series is thus bounded above by the sum of the non-alternating terms.

Proving √2 is irrational

Wed, 26 Jul 2017 00:00:00 +0000

Introduction

When you first encounter number systems, the story usually goes something like this:

There are obviously positive whole numbers. As Leopold Kronecker said: “God made the integers, the rest is the work of man.” (While it’s true that you can construct $\mathbb{N}$ set theoretically, we will take the natural numbers as given here.)
In order to add the idea of additive inverses — which would imbue the natural numbers with richer algebraic structure — we construct $\mathbb{Z}$ (which includes the negative numbers).
Next, we note that while multiplying members of $\mathbb{Z}$ leaves us inside $\mathbb{Z}$, division, the anti-multiplication, does not. To accommodate division and to augment $\mathbb{Z}$ to include the idea of multiplicative inverses (read: reciprocals), we construct the rational numbers (denoted $\mathbb{Q}$), i.e. the set of quotients of integers with the caveat that the denominator of said quotient must be nonzero. (Note: This is also probably the first example of an algebraic structure called a field that students encounter.)
Then one of our professors observes that although $\mathbb{Q}$ is dense in a way that $\mathbb{Z}$ is not, it still has holes that are occupied by irrational numbers. To bring irrational numbers into the fold and finally construct a system known as the continuum (it has no holes), we construct $\mathbb{R}$, the real numbers from $\mathbb{Q}$.

Confusion

During step 4, professors often introduce the irrational numbers by appealing to the mysterious $\sqrt{2}$. The question I want to tackle in this post is: what does that symbol mean? At first, I might have thought that $\sqrt{2} = 1.414\dots$ The problem with this line of reasoning is that it presupposes the number’s existence in order to tell us what it is. Another thing it might try is to argue that $\sqrt{2}$ is the limit of the following sequence rational numbers: $1, 14/10, 141/100, 1414/1000, \dots$, but this explanation is fraught with the same circularity as my first attempt was. So how might I prove that there is some real number whose square is 2 without assuming such a number exists? To do this, I will appeal to the completeness of the real numbers. That $\mathbb{R}$ is complete means that every subset $S$ of real numbers that is bounded above has a least upper bound $\beta$. We write this compactly as $\beta = \sup S$. (We analogously define the greatest lower bound of a set $S$ if $S$ is bounded below — we denote said lower bound as $\inf S$.) Below we will require the technical definition of a least upper bound, so I will provide its two components here:

For $x \in S$, $\beta \geq x$.
If $\gamma$ is an upper bound of $S$, $\beta \leq \gamma$. Look at the above for a second and make sure they jive with your intuition about what a least upper bound is. It’ll make the rest of this much easier to understand.

Now, given the completeness of $\mathbb{R}$, consider the set $S = \{t \in \mathbb{R} | t^2 < 2\} \subseteq \mathbb{R}$. It is clear that this set is bounded above by 2. By completeness, this tells us that there is some real number $\alpha = \sup S$ that is $S$’s least upper bound. To show that there is a real number that you can square to get 2, we just need to show that $\alpha^2 = 2$. We will do this by ruling out the possibilities that $\alpha^2 > 2$ and that $\alpha^2 < 2$.

Ruling things out

If $\alpha^2 < 2$, then consider the number (we will discuss what $n$ to pick in a second):

$$(\alpha + \frac{1}{n})^2 = \alpha^2 + \frac{2\alpha}{n} + \frac{1}{n^2} < \alpha^2 + \frac{2\alpha + 1}{n}.$$

What value of $n$ suits our needs here? Well, what we *would like *to show here is that $(\alpha +\frac{1}{n})^2 < 2$, thus contradicting property (1) of least upper bounds that $\alpha$ must satisfy. As such, we need to pick an $n$ such that $\frac{2\alpha + 1}{n}$ fits in the gap between $\alpha^2$ and 2. Put mathematically, we want $\frac{2\alpha + 1}{n} < 2 - \alpha^2$.

Rearranging a bit, we see that for the above to be true, we need $\frac{1}{n} < \frac{2 - \alpha^2}{2\alpha + 1}$. Such an $n$ surely exists (for more technically inclined readers, this follows from the archimedeanity of $\mathbb{R}$), so we now would like to show that $\alpha + \frac{1}{n} \in S$. This is now simple, because with the value of $n$ we’ve chosen, we have

$$\alpha^2 + \frac{2\alpha + 1}{n} < \alpha^2 + (2 - \alpha^2) = 2.$$

Thus, if $\alpha^2 < 2$, $\alpha + \frac{1}{n} > \alpha$ is a member of $S$. This directly contradicts our supposition that $\alpha$ is the least upper bound of $S$ — we’ve found a number larger than $\alpha$ that is in $S$. Now, if $\alpha^2 > 2$, we proceed by a similar argument. Consider the number $(\alpha - \frac{1}{n})^2 = \alpha^2 - {2\alpha}{n} + \frac{1}{n^2} > \alpha^2 - {2\alpha}{n}$ This time, we want to show that $\alpha - \frac{1}{n}$ is an upper bound for $S$, thus contradicting the supposition that $\alpha$ is the least upper bound of $S$. To this end, again, choose $n$ so that $\frac{2\alpha}{n}$ fits in the gap between $\alpha^2$ and 2. Mathematically, we want $\frac{2\alpha}{n} < \alpha^2 - 2$, so we choose $n$ so that

$$\frac{1}{n} > \frac{\alpha^2 - 2}{2\alpha}.$$

By doing this, we get that

$$(\alpha - \frac{1}{n})^2 > \alpha^2 - {2\alpha}{n} > \alpha^2 - (\alpha^2 - 2) = 2$$

But this contradicts the supposition that $\alpha$ was $S$’s *least *upper bound! Thus, $\alpha^2 = 2$, so we can rigorously associate the symbol $\sqrt{2}$ with $\alpha$.

Conclusion

I guess what I was aiming to show here is that math gives us tools to find answers to questions we don’t initially see yet are nonetheless critical to the logical soundness of what we hope to build (and sometimes have already built — yikes!) on top of them. These — often fundamental — questions are sometimes hidden behind veils of apparent obviousness. The task of securing air-tight foundations upon which mathematicians do their work is a subtle game, and I thought this was an accessible example of some of that subtlety.

The Basel problem

Fri, 26 May 2017 00:00:00 +0000

Introduction

In this post, I want to talk about what the Basel Problem is and how Euler solved it. Even today, Euler remains one of the most accomplished mathematicians there ever was. His work impacted and created a multitude of fields across mathematics: number theory, graph theory and topology, real and complex analysis, and parts of physics. His solution to the Basel Problem, which we will discuss below, catapulted him to fame in 1734.

Background

Before we attack the Basel Problem itself, I want to set the stage. Students who have taken an introductory course in calculus are familiar with the harmonic series:

$$\sum_{n=1}^\infty \frac{1}{n} = 1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + \dots.$$

This series is the first example students are usually given of a series that diverges even when its individual terms tend to 0. The fact that the harmonic series diverges was originally proved by Oresme in the 15th century, but his work was lost. About 250 years later, in the mid 17th century, the Bernoulli brothers re-proved the result. Their success sparked their interest in the convergence/divergence of other infinite series, one of which happened to be a natural extension of the harmonic series:

$$\sum_{n=1}^\infty \frac{1}{n^2} = 1 + \frac{1}{4} + \frac{1}{9} + \frac{1}{16} + \dots.$$

The Basel Problem was to find the sum of this series. As a note before we see how Euler did it, figuring out whether an infinite sum converges is typically a much easier problem than computing the sum. It’s easy to show that $\sum \frac{1}{n^2}$ converges using any one of a number of simple convergence tests (e.g. comparison, integral), but finding the actual sum is a different matter entirely. This brings us to our main result.

Euler’s Argument

What we want to prove can be stated quite succinctly.

Theorem: $\sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}$.

Euler started off by considering the infinite polynomial

$$f(x) = 1 - \frac{x^2}{3!} + \frac{x^4}{5!} - \dots + \frac{(-1)^k x^{2k}}{(2k + 1)!} + \dots.$$

For $x \neq 0$,

$$f(x) = \frac{xf(x)}{x} = \frac{x - \frac{x^3}{3!} + \frac{x^5}{5!} - \dots + \frac{(-1)^k x^{2k + 1}}{(2k + 1)!} + \dots}{x}.$$

Note that the numerator of $f(x)$ is actually the Taylor expansion of $\sin(x)$, so we can actually (for $x \neq 0$) write

$$f(x) = \frac{\sin(x)}{x}.$$

Provided that we can find the roots of $f(x)$, we can factor it. In our case, $f(x) = 0$ when $\sin(x) = 0$, so the infinitely many roots $f$ are $k\pi$ for all integers $k$. We can thus factor $f$ as:

$$f(x) = (1 - \frac{x}{\pi})(1 + \frac{x}{\pi})(1 - \frac{x}{2\pi})(1 + \frac{x}{2\pi})\dots.$$

(This factorization comes from a theorem in algebra that states that for a polynomial $p(x)$, if the roots of $p$ are $a_1, a_2,\dots, a_n$ and $p(0) = 1$, then you can factor $p(x) = (1 - x/a_1)(1 - x/a_2)\dots(1-x/a_n)$.)

Next, we observe that each pair of factors of the form $(1 - \frac{x}{k\pi})(1 + \frac{x}{k\pi})$ can be combined into $1 - \frac{x^2}{k^2\pi^2}$, so f now looks like

$$f(x) = (1 - \frac{x^2}{\pi^2})(1 - \frac{x^2}{2^2\pi^2})(1 - \frac{x^2}{3^2\pi^2})\dots$$

Okay… this is good, because $f$ smells of both $\pi$ and the squares of natural numbers…

To force these pieces into place, Euler then multiplied out our factorization and collected the coefficients of like powers of $x$. In particular, as it pertains to the our problem, he only cared about the coefficient on the $x^2$ term of $f$. The only $x^2$ terms in $f$ are those that are produced by multiplying an $x^2$ term with a 1. Adding up these terms, we get the following sum:

$$-\frac{x^2}{\pi^2} - \frac{x^2}{4\pi^2} - \frac{x^2}{9\pi^2} - \dots = x^2\biggr[-\frac{1}{\pi^2}\biggr(1 + \frac{1}{4} + \frac{1}{9} +\dots\biggr)\biggr],$$

so the coefficient is

$$-\frac{1}{\pi^2}\biggr(1 + \frac{1}{4} + \frac{1}{9} +\dots\biggr).$$

From our original representation of $f$ (before we multiplied it by $\frac{x}{x}$), we know what the coefficient should be, namely $-\frac{1}{3!}$. Equating the coefficient we found with what we know it should be, we have

$$-\frac{1}{3!} = -\frac{1}{\pi^2}\biggr(1 + \frac{1}{4} + \frac{1}{9} +\dots\biggr).$$

which, rearranged a bit, solves the Basel problem because it means that

$$\frac{\pi^2}{6} = 1 + \frac{1}{4} + \frac{1}{9} + \frac{1}{16} + \dots.$$

Conclusion

Before I conclude, I want to clarify that this argument was not entirely rigorous. Without justification, he produced much of his argument by extending finite results to the infinite.

Nonetheless, his solution to the Basel Problem is a great example of the ingenuity with which Euler attacked many of the open mathematical conundrums of his day. Despite its reputation as rote and formulaic, mathematics requires a wild imagination. Euler’s was one of the wildest, and mathematicians who inherited his legacy could not be more thankful.

Hilbert's hotel

Wed, 05 Apr 2017 00:00:00 +0000

Introduction

It’s often the case that when people try to reason about infinities, they get lost in forests of paradoxes. More precisely, they stop being able to intuitively make their way around the mathematical landscape. You see, infinity isn’t something we deal with in our daily lives. You could probably argue that we are biologically predisposed to have trouble with it; Scott Aaronson wrote a great article that speaks to this.

A few months ago, while I was TAing a class on ideas in mathematics, I was asked to give one of the lectures. I decided to make it about the basics of infinity and in it, I walked through things like what it means for infinite sets to have the same size, Cantor’s diagonalization argument and some other related items of interest (right out of some of my earlier blog posts). I then talked about one of the famous examples of infinity’s mindbending amazingness: Hilbert’s hotel. The students in the lecture seemed into it, so I told myself after that lecture that I would try to write something about it here… this is that something.

A few people

I run a strangely constructed hotel. It has one infinitely long hallway with countably infinite rooms, numbered 1, 2, 3, 4 and on and on and on. There’s just one problem, though: all of the rooms are occupied and I have a guest at the front desk who wants a room. How might I accommodate him?

(Think about this for a minute… what would you do?)

After much thought and consulting some mathematically oriented consultants, I decided to have everyone move over by 1 room. That is, the person in room 1 moves to room 2, the person in room 2 moves to room 3 and so on. Mathematically, the person in room $n$ moves to room $n + 1$. As is hopefully clear, by doing this, room 1 is now open and, after I get some of the staff to clean it, ready for my new guest. Extending this method further, we can actually accommodate any finite number $k$ of guests by just having everyone move over $k$ rooms. For example, if $k = 78$, we would put the guest currently in room 1 in room 79, the person in room 2 into room 80 etc. This would free up the first $k$ rooms and enable us to accommodate the $k$ new guests.

A caravan

Ok, that wasn’t too bad, and the solution seems plausible enough. A few months later, however, I encountered, shall we say, a bigger problem. A countably infinite number of people came to the desk and said they all wanted rooms. Thinking back to how I accommodated a finite number of guests, I quickly realized that this wasn’t going to work. I can’t exactly ask a guest in room 1 to move to room $\infty$ and the person in room 2 to move to room $\infty + 1$; they’d never stop walking down that hallway! What kind of hospitality would that be?! The only acceptable way to accommodate new guests is one that assign each guest that currently has a room a specific new room… can I do this for an infinite number of guests?

Turns out I can. After consulting my friends again, we decided that the easiest way to accomplish this would be to free up all of the odd numbered rooms. Formally, we would move the guests in room $n$ to the room $2n$. Note that no two guests get assigned the same room; the guest in room 7 is the only guest who ends up in room 14. Observe also that after the move, the only occupied rooms are the even numbered rooms because any room that we moved someone to with our rule is even numbered. Thus all of the odd rooms are open and we can put our guests into the odd numbered rooms.

Why did this work mathematically? I’m going to assume some knowledge of the terminology that follows, but the reason is that the function $f : \mathbb{N} \to \mathbb{E}$ (where $\mathbb{E}$ is the set of even numbers) given by $f(n) = 2n$ (our rule) is actually a bijection (or one-to-one correspondence). This means we can match each natural number with exactly one even number and that every even number gets hit (think about this for a minute; try to convince yourself of this). In some sense, the fact that $f$ is bijective tells us that we can fit $\mathbb{N}$ into $\mathbb{E}$ (which is actually itself a subset of $\mathbb{N}$). Doing this frees up any rooms whose labels aren’t in $\mathbb{E}$, namely the odd rooms. What we’ve actually done here is shown that in a mathematically precise way, there are as many even numbers as there whole numbers… cool, no?

Caravans and caravans

At this point, I thought I was good to go. Now that I know how to accommodate an infinity of guests, what more could I possibly need to know? Turns out, my biggest challenge of them all was yet to come. A few years after solving the infinite guest problem, a countably infinite number of caravans each carrying a countably infinite number of guests showed up and said that all of them wanted rooms. This seemed rather daunting… how was I going to tackle this one?

Well, I said, when the going gets tough, the tough get going. So, naturally, I called my friends again ;) This time, they suggested the following scheme:

Free up the odd rooms (we know how to do this already).
Label each caravan with the odd primes in ascending order (there are infinitely many of these, so we’ll have enough for all caravans).
For each caravan, label the people in it 1, 2, 3, 4 etc.
Put person $n$ from the caravan labeled $p$ into room $p^n$.

Ok, seems simple enough. But does it work?

To show that it does observe the following few simple facts:

If $p$ is an odd prime, then $p^n$ will also be odd, so I will never assign a
guest to an even room.
A power of $p_1$ will never contain a factor of $p_2$ (e.g. $5^3$ can’t contain any prime factors except for $5$); this means there is no way that I’ll accidentally assign two people from different caravans to the same room.
$p^i \neq p^j$ for $i \neq j$, so we see that with this rule, I won’t assign any pair of people from the same caravan to the same room.

Given these three facts, we see that this rule successfully gives all of our guest rooms. It even leaves a bunch of rooms unoccupied! Can you see which ones?

(As a technical aside, we just showed that you can match a set of size $\mathbb{N} \times \mathbb{N}$ — our caravans — with a set of size $\mathbb{N}$ — the odd numbers — in a one to one correspondence. Without going into too much detail, by noting that $\mathbb{Q}$, the set of rational numbers, has size $\mathbb{N} \times \mathbb{N}$, the last leg of our journey is actually a proof of the somewhat surprising fact that $|\mathbb{Q}| = |\mathbb{N}|$.)

Conclusion

It’s been a pleasure taking you through some of my experiences as a hotel manager. As of late, business is booming and I couldn’t be happier. You can bring as many friends as you like and I can almost guarantee we’ll be able to make room for all of you. If you bring uncountably many friends, though, I’ll have to send you to Cantor’s hotel down the street.

The derivative via linear algebra

Wed, 15 Mar 2017 00:00:00 +0000

Introduction

As a math major in college, I had been, for a long time, been under the impression that calculus and algebra were totally separate parts of math. The types of problems you thought about in one of them were totally disjoint from the types of problems you tackled in the other. Continuous vs. Discrete. Algebraic this vs. Analytic that. As I was watching (a wonderful) video series on linear algebra by 3blue1brown, I came across the following really cool connection between calculus and algebra that was simple, elegant and clever. But, more importantly, it spectacularly illustrates the connections that one finds between seemingly separate parts of math. Let’s take a look at finite polynomials.

Polynomials

Polynomials are mathematical objects of the form $p(x) = a_0 + a_1x + a_2x^2 + \dots + a_nx^n$, where the $a_i$ are scalars drawn from a field $F$ and $n$ is an arbitrary natural number. It’s easy to check that polynomials actually make up a vector space over $F$. Formally, this means that:

Polynomials make up a commutative group under addition.
1. There’s an identity element ($p(x) = 0$ is the identity).
2. Every element has an inverse (the inverse of $p(x)$ is $-p(x)$).
3. Addition is associative ($p(x) +(q(x) + r(x)) = (p(x) + q(x)) + r(x)$)
4. Adding two polynomials always produces another polynomial (this property is called closure).
If you multiply a polynomial by a scalar, the result is yet another polynomial.
Scalar multiplication distributes over polynomial addition (if $p$ and $q$ are vectors and $c$ is a scalar, the three must satisfy $c(u + v) = cu + cv)$.
Vector multiplication distributes over scalar addition (if $c_1$ and $c_2$ are scalars and $v$ is a vector, they must satisfy $(c_1 + c_2)v = c_1v + c_2v$).
There must be a multiplicative identity ($p(x) = 1$).
If $c_1$ and $c_2$ are scalars and $p$ is a polynomial, then $c_1(c_2v) = (c_1c_2)v$.

I’m going to skip verifying these, but if you think about them, they’re mostly (if not all) sort of intuitive. For the rest of the post, we are just going to assume that polynomials make up a vector space.

Calculus detour

Let’s jump over to calculus for a minute. Do you remember how we differentiate a polynomial? For example, if $p(x) = 3x^2 + x + 7$, what is $D(p(x))$? If we recall our first calculus course, we remember that we were told that we could differentiate each of $3x^2$, $x$ and $7$ separately and then add the results together. Furthermore, we have two differentiation rules that will help us differentiate a single term:

$D(x^n) = nx^{n-1}$.
$D(cf(x)) = cD(f(x))$ (you can pull out constants).

With these rules in hand, we see that the derivative of $3x^2$ is $3 \cdot 2x = 6x$, the derivative of $x$ is 1 and the derivative of 7 (or any other constant, for that matter) is 0. Adding these together, we conclude that $D(p(x)) = 6x + 1$. Okay, now reread the calculus we just thought through and keep it in mind; we have to jump back to linear algebra for a second.

Differentiation is a linear map

If I have two vector spaces $V$ and $W$ over a field $F$, then a map $T:V \to W$ is said to be linear if:

$T(u + v) = Tu + Tv$.
$T(cu) = cTu$ (for $c \in F$).

The first rule says says applying a linear transformation to a sum of vectors should produce the same result as if you applied the transformation to each result and then added them in the target space. This looks kind of familiar, doesn’t it? Above, when we computed $D(p(x))$, we took $p(x)$ apart, applied $D$ to each part, and then put the results back together… In other words, we said that

$$D(p(x)) = D(3x^2) + D(x) + D(7).$$

It’s easy to see that this rule, whereby we are allowed to decompose things, work on them, and put them back together, generally applies to the differentiation of any polynomial, so we’ve established that the polynomial differentiation operator $D$ satisfies the first property of linear maps!

Furthermore, if we look at the second differentiation rule that helped us up above, it is exactly the second property of linear transformations! (Just replace $T$ with $D$ and $u$ with some polynomial $p(x)$.)

We thus see that the operator $D$, which takes the derivative of a polynomial, is linear!

To sum up what we’ve said so far:

The space of polynomials is a vector space (we will henceforth call $P$).
The differentiation operator, $D$, is a linear transformation from $P$ to itself (because differentiating a polynomial always gives another polynomial).

Thus, once we produce a convenient basis for $P$, we can actually write down a matrix that will do differentiation of polynomials for us! But what basis should we use?

Because polynomials in $P$ can have arbitrarily large degree, our basis will actually be infinite. The basis we choose is actually inherent in the general structure of polynomials. Can you see what it might be? Because polynomials of degree $n$ are just linear combinations of the infinite list $\{1, x, x^2, \dots\}$ (e.g. $3x^2 + 4x + 3$ can be seen as $3 \cdot 1 + 4 \cdot x + 3 \cdot x^2 + 0 \cdot x^3 + 0 \cdot x^4 +\dots$), we will call this set our basis (verify span and linear independence!) and now use it to write down a(n infinite) matrix corresponding to $D$.

Note that the $i$th column of a matrix describes what the transformation does to the $i$th basis vector of our space. So, in order to write down the first column of $D$’s matrix, we need to know what $D(1)$ is written as in terms of $P$’s basis vectors. Well, if $D(1) = 0 = 0 \cdot 1 + 0 \cdot x + \dots$, then the first column of our matrix must be

$$\begin{bmatrix} 0 \\ 0 \\ 0 \\ \vdots \end{bmatrix}.$$

To determine the next column, we look at what $D$ does to our second basis vector, $x$. $D(x) = 1 = 1 \cdot 1 + 0 \cdot x + \dots$, so the second column of our matrix would look like

$$\begin{bmatrix} 1 \\ 0 \\ 0 \\ \vdots \end{bmatrix}.$$

The last basis vector we need to look at before we can intuit the rest of the columns is $x^2$. $D(x^2) = 2x = 0 \cdot 1 + 2 \cdot x + 0 \cdot x^2 + \dots$, so the third column is

$$\begin{bmatrix} 0 \\ 2 \\ 0 \\ \vdots \end{bmatrix}.$$

You could probably guess the next column, and the one after that, and, most probably, all of the ones after that… we finally have this matrix for $D$ (note that it’s infinite):

$$A = \begin{bmatrix} 0 & 1 & 0 & 0 & \dots \\ 0 & 0 & 2 & 0 & \dots \\ 0 & 0 & 0 & 3 & \dots \\ \vdots & \vdots & \vdots & \vdots\end{bmatrix}.$$

If you represent a polynomial as a(n infinitely long) vector of its coefficients, then you can actually do differentiation with this matrix. For example, if your polynomial was $p(x) = 4x^3 + 5x^2 + 29x + 9$, you would perform

$$A\begin{bmatrix} 9 \\ 29 \\ 5 \\ 4 \\ 0 \\ \vdots \end{bmatrix} = \begin{bmatrix} 29 \\ 10 \\ 12 \\ 0 \\ 0 \\ \vdots \end{bmatrix},$$

i.e. your derivative is $12x^2 + 10x + 29$, which, using the rules you learned in calculus class, is demonstrably correct.

Conclusion

To sum up, we’ve reconceived the space of polynomials as a vector space and used notions from both linear algebra and calculus to come up with a pretty nice looking matrix that doesn’t intuitively look like differentiation, but that somehow perfectly describes it when you look at it through the right lens. There are connections like these all over mathematics, you just have to know where to look.