Tale of two distributions

Introduction

A theme of the many posts I’ve written over the last few years is that there are deep and beautiful connections we find between apparently different areas by appealing to a little bit of formalism and some finely-tuned intuition. In this case, the two objects I will connect do not look too different from one another. In fact, they look eerily similar; it just isn’t immediately clear how to connect the dots.

(For this post, I’m going to assume a basic understanding of random variables and probability distributions (more or less just what they are). I’ll also assume basic familiarity with permutations and combinations — just the definitions — and some basic facility with limits. It’s more than I like to assume and I’ll do my best to explain things intuitively as I go along, but if you stick with me, I think it will be well worth your while. This stuff is very, very cool.)

Binomial distribution

Suppose I’m flipping a fair coin. If $X$ is a random variable that takes the value 1 on heads and 0 on tails, we might encode the likelihoods of different outcomes of this experiment as $P(X = 1) = P(X = 0) = 1/2$. More generally, if I have some experiment with two possible outcomes, one of which occurs with probability $p$ (we call this event a success) and the other of which occurs with probability $1 - p$ (we call this a failure), a random variable representing the outcome of the experiment is said to be a “Bernoulli $p$” (henceforth $\text{Bern}(p)$) random variable. The random variable $X$ from the example I opened with is $\text{Bern}(1/2)$, but the general story that describes what sorts of sample spaces might hint at this underlying distribution is the same no matter what $p$ is: I have some experiment with two possible outcomes, one that occurs with probability $p$ and another that occurs with probability $1 - p$.

Let’s make things more interesting. Suppose I repeat this Bernoulli (read: two-outcome) experiment $n$ (read: a bunch of) times, one after the next, where each trial is independent of all the others and ask you: What is the probability that $k$ (read: some number smaller than $n$) of experiments result in a success?

(Think about this and try to work out the answer before reading on.)

As Alon once recently put it, your internal monologue should go something like: “Well… Because my trials are independent of one another, I’m probably dealing with a bunch of probabilities multiplied together. The probability of one success is $p$, so the probability of $k$ successes has got to be $p^k$. The remaining $n - k$ events must be failures, so by similar logic, the probability that the rest of the experiments fail is $(1-p)^{n-k}$. Multiplying these together, I’m pretty sure the answer is $p^k(1-p)^{n-k}$.”

Almost! There’s just one piece you’d be missing. As an example, let’s think about the different ways to get two heads in three coin flips. You can either get HHT, HTH or THH. Each of these has probability $(1/2)^2(1/2)$, but there are three ways to achieve an outcome with that probability, so the probability of getting two heads actually ends up being $3(1/2)^2(1/2)$.

More generally, the probability of any one ordering of experiment outcomes with $k$ successes is indeed $p^k(1-p)^{n-k}$, but to be complete, you need to count up all of the different ways to arrive at $k$ successes in a sequence of $n$ trials. The number of configurations in which $k$ of the $n$ trials are successes is ${n \choose k}$, so the probability of seeing $k$ successes in $n$ independent Bernoulli trials is ${n \choose k}p^k(1-p)^{n-k}$.

The above story, wherein I’m trying to compute the probability of seeing a certain number of successes across a bunch of Bernoulli trials, describes what is called the binomial distribution. (An interesting note here: if you multiply out $(a + b)^n$, you’ll get $a^n + na^{n-1}b + {n \choose 2}a^{n-2}b^2 + \dots + b^n$. See if you can phrase what’s happening there in the language of permutations and combinations.) The binomial distribution is one of the most famous discrete distributions there is and it has a wide range of applications all over probability. Before explaining why I’ve led you here, we need to take a quick detour.

Poisson distribution

While the binomial distribution is just doing its thing, minding its own business, in a galaxy far, far away sits another distribution: the Poisson. The Poisson distribution is usually used to describe spaces in which many trials occur and trial has a very small probability of success. As an example, a Poisson random variable might represent the number of emails you received in the past hour. The likelihood that any one person emailed you during that hour is very low (low probability of success), but the number of people (trials) who could possibly email you is very high.

Deriving the PMF (probability mass function — essentially a formula that allows you to calculate the probabilities of various events) for this is not as easy as it was to do with the binomial, but before we jump in and try to figure out what that might look like, it’s useful to note that our Poisson story bears some interesting similarity to the story we used to define the binomial. In both cases, there are a bunch of trials taking place, each of which has some probability of success and we want to know what the probability of seeing a certain number of successes is. To crystallize this fuzzy-seeming connection, and this is the critical point, what we are trying to capture with the Poisson story is essentially the same thing we captured when we derived the binomial, but specifically when $n$ is large and $p$ is small. Can we formalize this somehow?

The connection?

We can! What we’re going to do first is to define the relationship between $n$ and $p$. Above, we revealed the need to express that $p$ is small and $n$ is large. One way to do that is to enforce that $np$ be constant (there are other ways, but this is the one that will be useful here); we’ll name that constant $\lambda$. Why does this help us? Well, if I pick a value of $\lambda$, say 1, before I start, then once I choose $n$, I’ve determined $p$. As an example, if $n = 100$, then because $np = 1$, $p$ must be 1/100. Furthermore, as $n$ gets larger we see that $p$ has to get smaller to keep lambda constant, so if we let $n$ tend to $\infty$, we are implicitly forcing $p$ to tend to 0. Being able to write $p$ in terms of $n$ is going to be very helpful in what’s to come.

With the relationship between $n$ and $p$ specified more precisely, we will now see what happens to the binomial PDF as $n$ grows. That is, we want to compute

$$\lim_{n \to \infty} {n \choose k} p^k(1 - p)^{n - k}.$$

Because $p = \lambda / n$ we can rewrite all the $p$s in the above limit as $\frac{\lambda}{n}$s:

$$\lim_{n \to \infty} {n \choose k} (\frac{\lambda}{n})^k(1 - \frac{\lambda}{n})^{n - k}.$$

Simplifying a little bit, the limit can be rewritten as

$$\lim_{n \to \infty} \frac{n!}{k!(n-k)!} \frac{\lambda^k}{n^k}(1 - \frac{\lambda}{n})^{n - k} = \lim_{n \to \infty} \frac{n(n-1)(n-2)\dots(n-k+1)}{n^k} \cdot \frac{\lambda^k}{k!} \cdot (1 - \frac{\lambda}{n})^{n - k}.$$

We will look at each of term of the product one at a time. Provided that their limits all exist, we can glue them together. For the leftmost multiplicand, noting that $\frac{n-m}{n}$ for $m$ constant tends to 1 as $n$ tends to $\infty$, we observe that we have $k$ numerator-denominator pairs that all tend to 1. Multiplied together this shows us that the leftmost term tends to 1. We can pull $\frac{\lambda^k}{k!}$ out of the limit because $k$ is constant. Combined with our analysis of the first term, we need still need to solve

$$\lim_{n \to \infty} (1 - \frac{\lambda}{n})^{n - k}.$$

Because $k$ is constant and $n$ is shooting off to $\infty$, $n-k$ behaves the same as $n$ in our case, so we can rewrite the limit as

$$\lim_{n \to \infty} (1 - \frac{\lambda}{n})^n.$$

This limit is a thinly veiled version of a common exercise from standard first semester calculus courses. I challenge you to convince yourself that

$$\lim_{n\to \infty} (1 + \frac{a}{n})^{bn} = e^{ab}.$$

(Hint: Start by setting $y = \lim_{n\to \infty} (1 + \frac{a}{n})^{bn}$ and taking the natural log of both sides.)

In our case, $a = -\lambda$ and $b = 1$, so our last piece evaluates to $e^{-\lambda}$. Gluing our pieces together, we have the PMF of the Poisson distribution: if $X$ is a Poisson random variable, $P(X = k) = \frac{e^{-\lambda}\lambda^k}{k!}$. (Check Wikipedia, I dare you.)

Conclusion

When I saw the above, I thought it was a really nice use of mathematical formalism to ground a connection we first noticed more intuitively. I find that it is often in finding footing for these sorts of beautiful connections that math shines brightest.

(This post was inspired by Joe Blitzstein’s Stat110 course on YouTube. Your course is awesome and I’ve learned a ton. Thank you.)

Tale of two distributions

Introduction

Binomial distribution

Poisson distribution

The connection?

Conclusion

Further Reading

The connection between k-means and Gaussian mixtures

The Monty Hall paradox

The birthday problem

Trending Tags