Overfitting and Regularization

Overfitting is “fitting the data more than is warranted.” It occurs when you select a hypothesis that works better on the training data but worse on the test data.

(images are due to Abu-Mostafa and “Learning from Data”)

In this image, the blue curve is the actual target, the points are the training data drawn from the curve with noise introduced, and the red curve is a 4th degree polynomial that perfectly fits the training data. The error of the red curve on the training set is zero, but if we generate any more points from the blue curve, the error of the red is likely to be very large.

Overfitting can have a few related causes:

Too many parameters or too powerful a model compared to the size of the data. The parameters memorize the data, which is part of what is happening in the image above.
Fitting to the noise in the input.
Correlation in the training input. Essentially, if you accidentally train on data that is too similar to each other, the learning algorithm will generalize in ways that don’t fit the data.

You might ask, “Aren’t point 1 and 2 the same? If there hadn’t been noise, wouldn’t we have learned the function correctly?” Let’s look at overfitting without noise:

Here, the data was generated from a 50th order function without noise. The red curve is a 10th order fit, and the green curve is a second order fit. While not perfect, the 2nd order is significantly better. In fact, the 2nd order has a 0.029 error on the training set and 0.120 error on the test set, while the 10th order has a 0 error on the training set and a 7680 error on the test set.

We can often fix the issue of input correlation by selecting better data. We can randomly select our training data from the overall data pool multiple times and try learning on each of those sets; this is known as cross-validation.

Finally, you may wonder how linear regression can overfit if it is just a line. That is true for simple lines, but remember that we might be using high-dimension features. Even though we are finding a line in the feature space, that maps to a potentially very complex curve in the original space. In this case, it is not the parameters that are the problem, but the features! If we have too many complex features, each with their own parameter, we still get overfitting.

Regularization¶

What can we do about overfitting? There are many approaches, including the obvious: use simpler models. However, that is harder than it sounds. To paraphrase Einstein, every model should be as simple as possible, but no simpler.

Overfit solutions are characterized by large, rapidly changing gradients, meaning the lines of best fit oscillate wildly up and down. A solution without these large gradients will perform better. Suppose we have a prediction $y$ based off a data matrix as follows:

y=\begin{bmatrix}a_1&b_1&1\\a_2&b_2&1\\\vdots&\vdots&\vdots\\a_n&b_n&1\end{bmatrix}\begin{bmatrix}w_1\\w_2\\w_3\end{bmatrix}

(1)

The partial derivatives $\frac{\partial y}{\partial b}$ are equal to $w_1$ and $w_2$ , respectively. If the weights are small, the gradient will be small. What we want is a solution that fits the data well and has small weights. This is known as weight decay, because we are encouraging weights to decay towards zero.

There are a number of ways to define how “small” a vector $w$ is. The most common is the sum of squares of the weights, i.e., $\sum_i w_i^2$ , or the dot product of $w$ with itself, $w^Tw$ . This is called the $L_2$ norm of the vector. Just as we punish a $w$ that produces a large error when approximating $y$ (using $E_{in}(w)$ ), we are also going to punish a $w$ that has a large $L_2$ norm:

\min_w E_{in}(w) + \lambda w^Tw = \min_w \sum_{i=1}^N(w^Tx_i-y_i)^2 + \lambda w^Tw

(2)

The $\lambda$ is a tradeoff parameter—a simple scalar—that helps us decide how much we want to prioritize a good fit versus a small weight vector. As with unregularized linear regression, we take the derivative and set it equal to zero, which ends up with a pleasingly simple result:

w = (\lambda I + X^TX)^{-1}X^Ty

(3)

where $I$ is the square identity matrix. Using this solution is called linear regression with $L_2$ regularization, or simply ridge regression.

Normalization¶

With the introduction of regularization, normalization or standardization becomes very important. Consider the following two features:

X=\begin{bmatrix}30&3\\10&1\\20&2\end{bmatrix}

(4)

Clearly, these two features carry the same information. However, one is at a much larger scale than the other. Because regularization encourages weights to be small, a weight associated with the first feature can be smaller than a weight associated with the second feature while expressing the same thing. We are accidentally biasing the model toward favoring the first feature over the second rather than letting the data decide.

For this reason, it is common to normalize or standardize your data before fitting a regularized model. These terms essentially mean adjusting all features to be in the same scale. Both require the means of all features to be set to zero.

Normalize usually means to divide so the vector is in the range 0 to 1. This is min-max scaling:
$X' = \frac{X-X_{\min}}{X_{\max}-X_{\min}}$
(5)
Standardize usually means to divide so the standard deviations of all vectors are 1:
$X' = \frac{X-\mu}{\sigma}$
(6)

When we say “the vector,” we mean the column (the feature), otherwise the relationship between the two dimensions would remain the same. Remember to adjust your test set in the same way!

Finding your $\lambda$ ¶

How do we decide on $\lambda$ ? There are two possible ways:

Split your data into training and testing sets, then simply perform ridge regression with differing values of $\lambda$ , keeping the one that performs best on the test set.
Cross-validation: You may find that for different random choices of test sets, you get different best values of $\lambda$ . This happens especially with small datasets. This makes better use of small amounts of data, as you get to train on almost everything, though it is obviously much more computationally expensive.

Regularization¶

Normalization¶

Finding your λ\lambdaλ¶

Finding your $\lambda$ ¶