The weak law of large numbers

Introduction

When I say that Alice is better than Bob at chess, I’m effectively asserting that barring careless mistakes, Alice should always beat Bob. This notion, of Alice “being better than” Bob is easy to wrap one’s head around when the game in question doesn’t have to contend with randomness or uncertainty. What does it mean, though, to say that Alice is better than Bob at a game like backgammon, where dice (a source of randomness) are involved? The rest of this post aims to provide some mathematical machinery to answer such a question.

Whenever one talks about expecting some “eventual” outcome of a game or experiment, he or she is actually invoking a fundamental statistical law: the Law of Large Numbers (henceforth LLN). The LLN is one of a group of fundamental results in probability theory known as limit theorems, which are statements about how certain aspects of random variables stabilize as you make more and more observations of them. In plain English (we’ll get to it’s technical formulation a bit later), the LLN (and we’ll see that there are actually two variants) says that if you have an experiment whose average outcome is a number $m$, then as you try the experiment more and more times, the average value of your collection of outcomes will tend to $m$.

Examples

To get a feel for what’s going on here, let’s look at a few examples that demonstrate the LLN’s ubiquity and importance.

When you play blackjack against the house, its ability to make money hinges on a critical assumption: (without cheating,) even using a probabilistically optimal strategy, the chances that you win a hand is less than 50%. The house edge might be very very small, but as long as it has some nonzero edge, the house stands to make money in the long run because it is playing lots and lots of hands. This assumption relies on the LLN.
Let’s imagine that a basketball player is in a shooting slump. He regularly shoots around 40% from outside the three point line, but of late he’s only managed to connect on 20% of his attempts. Encouragement offered to such a player typically looks like “Don’t worry, just keep shooting, your shot will come back!” A coach who lifts his player’s spirits this way is also relying on the LLN.
To return to our backgammon example, when we say that Alice is better than Bob, what we’re saying is that in any individual game, it’s possible that Bob beats Alice, but if Alice and Bob were to play 100 or 1000 or 1000000 games against Bob, Alice would end up winning a majority of them. The more games they play, the more obvious the advantage would become.

Formal statement

As I’ve noted in many other posts, it is one thing to have a good idea, another to formalize it mathematically. The concept underlying the LLN is one that most of us intuitively grasp without understanding statistics. But in order to prove it and then use it as a building block to understand other, more subtle results, we need to be able to state it formally.

Before we do, I want to note that there are actually two well known versions of the LLN. We will concern ourselves here with the Weak Law of Large Numbers (the Strong Law is harder to prove but says something similar in spirit). Mathematically, we would state it as follows (don’t worry – we’ll break each piece of this down momentarily): Let $X_1, \dots, X_n$ be independent and identically distributed random variables all having finite average value $m$. Let $\bar X_n = \frac{1}{n}\sum_{i=1}^n X_i$. Then for any $\varepsilon > 0$, $P(|\bar X_n - m| < \varepsilon) \to 1$ as $n \to \infty$.

Let’s dissect this piece by piece and make sure it makes sense:

“Let $X_1, \dots, X_n$ be independent and identically distributed random variables”: This means that each observation $X_i$ is independent of all the others and is produced by the same distribution as all the others. Think of a black box that independently spits out a sequence of $n$ numbers using some fixed, unknown probability distribution. Each number the box spits out would be represented by an $X_i$. (The theorem is going to say what we can infer about the average of the black box distribution’s average value once we’ve made a sufficiently large number of observations.)
“Let $\bar X_n = \frac{1}{n}\sum_{i=1}^n X_i$”: $\bar X_n$ is the average value of the observations.
“Then for any $\varepsilon > 0$”: This is a math trick students usually first come across in real analysis. You basically pick some arbitrarily small value of $\varepsilon$ and then show that some quantity can always be made smaller than the value you chose. (In our case, that value will be the difference between the average value of the observations and the true average $m$.)
“$P(|\bar X_n - m| < \varepsilon) \to 1$ as $n \to \infty$”: The probability that the average value of the observations and the true mean of the underlying distribution differ by less than $\varepsilon$ becomes more and more certain as you take more and more observations.

Read that again if it didn’t make sense. Once all that sinks in, go back and read the formulation; I hope you find that it’s a delightfully compact way of formalizing the intuition we started with. Our final act will be to prove it. If you haven’t heard of Chebyshev’s inequality, read this before attempting the proof.

Proof

We will assume for this proof that the $X_i$ have finite variance $\sigma^2$, though this assumption is not necessary. (It’s just that if we don’t make that assumption, the proof gets more complicated.) First, let’s compute the mean and variance of $\bar X_n$. Because expectation is linear, and $\bar X_n$ is just a sum of RVs each having mean $m$, we have

\begin{align*} E(\bar X_n) &= \frac{1}{n}\sum_{i=1}^n E(X_i) \\\\ &= \frac{1}{n}\sum_{i=1}^n m \\\\ &= mn/n \\\\ &= m. \end{align*}

To compute the variance, we use the independence of the $X_i$ to write that

\begin{align*} \text{Var}(\bar X_n) &= \text{Var}(\frac{1}{n}\sum X_i) \\\\ &= \frac{1}{n^2}\text{Var}(\sum X_i) \\\\ &= \frac{1}{n^2}\sum \text{Var}(X_i) \\\\ &= \frac{1}{n^2}n\sigma^2 \\\\ &= \sigma^2/n. \end{align*}

Before we continue note that the fact that variance is a decreasing function of $n$ just about makes sense. The more observations I take, the smaller my variance should intuitively be.

Next, we use Chebyshev’s inequality with $\mu = m$, $k = \varepsilon$ and $X = \bar X_n$ to say that

\begin{align*} P(|\bar X_n - m| < \varepsilon) &= 1 - P(|\bar X_n - m| > \varepsilon) \\\\ &\geq 1 - \frac{\sigma^2/ n}{\varepsilon^2} \\\\ &= 1 -\frac{\sigma^2}{n\varepsilon^2}. \end{align*}

(The inequality sign is flipped because we have that $1 - \dots$ in there. Otherwise, it’s just plug and play Chebyshev.)

As $n$ gets larger and larger, that rightmost term tends to 0, so the probability of interest is bounded below by 1, and voila! We’re done.

Conclusion

This law is intuitive, deep and also deeply embedded in the way that we as human beings trying to navigate the world deal with uncertainty, so I thought it deserved a post.

Happy New Year to all!

The weak law of large numbers

Introduction

Examples

Formal statement

Proof

Conclusion

Further Reading

The connection between k-means and Gaussian mixtures

The Monty Hall paradox

The birthday problem

Trending Tags