Introduction
In statistics, the overarching goal – in some sense – is to figure out how to take a limited amount of data and reliably make inferences about the broader population we don’t have data about. To do this, we study, develop, and use mathematical objects called statistics or estimators. Formally, these are functions of other objects called random variables, but, for the moment, it suffices to think of them as ways of using limited amounts of data furnished by a part to learn about the whole.
As an example, let’s say that you wanted to find out the average height of all humans on earth, and let’s further suppose that the actual average human height is 3.5 feet. You might take a random sample of 1000 people, add up all heights and divide by 1000. The height that you get from that procedure, the mean height of your sample, is an estimator of the actual population height; let’s call it $A$. Alternatively, let’s say you again took that same sample of 1000 people, disregarded it and decided that your estimator of the mean population height was going to be zero feet, zero inches; call this (rather silly) estimator $B$.
Before developing any formal measure of the quality of an estimator, think about the above two estimators. Does one seem “more reasonable” than the other? For any parameter you want to estimate using some data you’ve collected, there are infinitely many estimators you could come up with; one natural concern we might seek to address is how to mathematically distinguish quality estimators from useless estimators.
Mean squared error = bias + variance
Mean squared error (henceforth MSE) is an attempt to formally capture the difference in the quality of different estimators. It is defined as the expected value of the square of the distance between the estimator’s value and the true value of the parameter you are trying to estimate with it. In the example above:
The true value of the parameter (average human height) is 3.5 feet. The estimator is the sample average of 1000 people that you calculated. Because the estimator is a function of random variables (heights of people you sampled), it, too, is a random variable, say $X$, with some distribution. We can therefore think about the expected value of some function of $X$ - in our case, the function is $f(X) = (X - 3.5)^2$. To compute the expected value you take a weighted average of all possible values $X$ can take and weight each one by the probability of seeing that outcome. (Don’t worry too much about why we’re squaring; it makes calculus easier.)
Mathematically, we write
where $\hat y$ is the estimator and $y$ is the true parameter value.
Intuitively, if we expect the estimator to, on average, stray far from the true value of the parameter, the estimator is probably not very good. Alternatively, if that expected value is close to 0, it means that the estimator deviates very little from the actual parameter, i.e. it’s a great estimator! With some tedious algebra, we can actually show that
Bias and variance are two very important qualities of estimators that help us understand how they relate to the real value you’re attempting to estimate:
Bias
Bias tells you the difference between the expected value of your estimator and that actual value of the parameter. To intuitively grasp this, imagine throwing several darts at a board, all of which strike the board close together, but off of the true center. You’re throws had high bias; in some sense, your “average” throw’s position would have been consistent, but generally off the mark.
Variance
Variance tells you how much the estimator tends to move around it’s expected value. If the estimator (just a function) takes values spread widely around it’s mean, variance will be high. Suppose you know two basketball players, both of whom average 15 points per game. One of them scores 30 one game and 0 the next, and the other scores around 15 consistently. While both players have the same scoring average, one of their scoring patterns is high variance – i.e. deviates pretty far from the mean – and the other is low variance.
With this decomposition in mind, we can see that if we choose to use MSE as our metric of estimator quality, it actually decomposes very nicely into two intuitively appealing sources of error. Therefore, to lower MSE, we need to either reduce bias or reduce variance (or reduce both!). It isn’t always so simple, though, as it is possible that by reducing one, you might have to raise the other, hence the name, the bias-variance tradeoff. There are a bunch of techniques that are actually quite useful in practice which leverage the decomposition of MSE into bias and variance.
In algorithms
For example, random forests are estimators that average the opinions of a bunch of high variance (aka not very general) decision trees to, in aggregate, function as a lower variance estimator. (This technique, a special case of a more general class of bagging algorithms, uses the fact that the variance of an average of $n$ independent estimators decreases as $n$ gets bigger.)
Attacking the MSE problem from the bias side, we have boosting algorithms, wherein the decision trees (or any other base estimator) are trained in sequence. Each tree in the sequence trains itself more tightly to the examples that the previous tree predicted incorrectly. In this sense, you are starting with a silly, low variance, estimator and gradually fitting it more tightly to the data, reducing bias, so that the sequence, all together, becomes more useful.
Conclusion
While I’ve admittedly left out some of the technical details of bagging and boosting, the point is to illustrate that something abstract-seeming like MSE can be understood in a very concrete way, and that the decomposition we discussed can actually lead to practical problem-solving approaches that are actually quite useful.