The overfitting problem and the bias vs. variance dilemma

We have seen what is linear regression, how to make models and algorithms for estimating the parameters of such models, how to measure the loss.

Now we see how to assess how well the considered method should perform in predicting new data, how to select amongst possible models to choose the best performing.

We will first explore the concept of training and test error, how they vary with model complexity and how they might be utilised to form a valid assessment of predictive performance. This leads directly to an important bias-variance tradeoff, which is fundamental to machine learning.

The concepts described in this post are key to all machine learning problems, well-beyond the regression setting.

Assess performance: Measures of loss

As we know from the statistician George Box:

 Remember that all models are wrong; but some are useful.

The practical question is how wrong do they have to not be useful.

We have seen previously how we can define a measure of loss, something that tells us how big is the error that our model makes when predicting on the historical data, the data that we used to train the model. This is called the training error or the in-sample error.

What the loss function specifies is the cost incurred when the true observation is y and instead I make some other prediction. Therefore the loss function is also called the cost function.

When you train a model f, you estimate the model parameters: beta.
They are used to make predictions: \hat{y} = f_{\beta }(X)

\hat{y} is our predicted value at some input X and y is the true value.

Therefore, the loss function, L, is measuring the difference
between these two values. For example, the absolute difference:

L = \left |y - \hat{y}\right |

or the squared error:

L =(y - \hat{y})^{2}

 

We have seen – in fact – how to use the Squared Error for  such a loss function.

An example: data from sales house

You can follow along with this Jupyter notebook in GitHub.
Dataset is from house sales in King County, the region where the city of Seattle is located.

First we will load the data into a Pandas data frame:

import pandas as pd
sales = pd.read_csv('kc_house_data.csv')
msales.shape
(21613, 21)
m = sales.shape[0] # number of training examples

The dataset contains information (21 features, including the price) related to 21613 houses. Our target variable (i.e., what we want to predict when a new house gets on sale) is the price.

Baseline: the simplest model

Now let’s compute the loss in the case of the simplest model: a fixed price equal to the average of historic prices, independently on house size, rooms, location, …

y = sales['price'] # extract the price column
avg_price = y.mean()  # this is our baseline
print ("average price: ${:.0f} ".format(avg_price))
average price: $542800

The predictions are very easy to calculate, just the baseline value:

def get_baseline_predictions():
       # Simplest version: return the baseline as predicted values
    predicted_values = avg_price
    return predicted_values

   # an example
my_house_size = 2500
estimated_price = get_baseline_predictions()
print ("The estimated price for a house with {} squared feet is {:.0f}".format(my_house_sqft, estimated_price))
The estimated price for a house with 2650 squared feet is 542800

If we put in a chart the training data points (price and size) we can see that the baseline would be a straight horizontal line and the training error would be given by all the vertical differences between the data points and the baseline:

baseline
The simplest model

The measure of loss, the RSS, on the test data for this simplest model would be:

import numpy as np
def get_loss(yhat, target):
    """
    Arguments:
    yhat -- vector of size m (predicted labels)
    target -- vector of size m (true labels)

    Returns:
    loss -- the value of the L2 loss function
    """
      # compute the residuals (since we are squaring it doesn't matter
      #   which order you subtract)
      # np.dot will square the residuals and add them up
    loss = np.dot((target - yhat), (target - yhat))

    return(loss)

baselineCost = get_loss(get_baseline_predictions(), y)

print ("Training Error for baseline RSS: {:.0f}".format(baselineCost))
print ("Average Training Error for baseline RMSE: {:.0f}".format(np.sqrt(baselineCost/m)))
Training Error for baseline RSS: 2912916761921302
Average Training Error for baseline RMSE: 367119

So this is the simplest model we could consider and you see that there is pretty
significant training error. On average, the error is more than 367K dollars per house!

Putting in another chart the model complexity together with the training error it would look like this for the baseline model:

models
The baseline model has high error

Now, we can look at how training error behaves as model complexity increases.

Learning a better but still simple model

Using a constant value, the average, is easy but does not make too much sense.
Let’s create a linear model with the house size as the feature. We expect that the price is dependent on the size: bigger house, more expensive.
We can use the sklearn linear_model that we saw already.

from sklearn import linear_model
simple_model = linear_model.LinearRegression()
simple_features = sales[['sqft_living']] # input X
simple_model.fit(simple_features, y)
  # Now that we have fit the model we can extract
  # the regression weights (coefficients) as follows:
simple_model_intercept = simple_model.intercept_
print (simple_model_intercept)
-43580.7430945
simple_model_weights = simple_model.coef_
print (simple_model_weights)
[ 280.6235679]

This means that the model looks like:
y = -43581 + 281 * x
where x is the size in square feet. It is not anymore a horizontal line but a diagonal one, with a slope:

simple

Now that we can make predictions given the model, let’s again compute the RSS and the RMSE.

    # First get the predictions using the features subset
predictions = simple_model.predict(sales[['sqft_living']])
simpleCost = get_loss(predictions, y)
print ("Training Error for baseline RSS: {:.0f}".format(simpleCost))
print ("Average Training Error for baseline RMSE: {:.0f}".format(np.sqrt(simpleCost/m)))
Training Error for baseline RSS: 1477276362322490
Average Training Error for baseline RMSE: 261441

And you see that the training error has significantly decreased.

model2
The second model has much smaller error but slightly higher complexity

We can add more features to the model, for example the number of bedrooms and bathrooms:

more_features = sales[['sqft_living', 'bedrooms', 'bathrooms']] # input X
better_model = linear_model.LinearRegression()
better_model.fit(more_features, y)

betterModel_intercept = better_model.intercept_
print (betterModel_intercept)
74847.1408013

betterModel_weights = better_model.coef_
print (betterModel_weights)
[   309.39239013 -57860.8943206    7932.71222266]

betterRSS = get_loss(better_model, sales[['sqft_living', 'bedrooms', 'bathrooms']], y)
print (betterRSS)
1436301587370053.5
print (betterRSS / simpleRSS)
0.9722632975132545

The model is now different, with three variables:
y = 74847 + 309 * x1 -57861 * x2 + 7933 * x3
where x1 is the size, x2 is the number of bedrooms and x3 the bathrooms.

predictions = better_model.predict(sales[['sqft_living', 'bedrooms', 'bathrooms']])
betterCost = get_loss(predictions, y)
print ("Training Error for baseline RSS: {:.0f}".format(betterCost))
print ("Average Training Error for baseline RMSE: {:.0f}".format(np.sqrt(betterCost/m)))

It is again lower, but this time not so much.

model3
Less error means higher complexity

We can go on, adding more features, such as the number of floors.
Or, although we often think of multiple regression as including multiple different features we can also consider transformations of existing features e.g. the log of the square feet, polynomial features such as squared or even “interaction” features such as the product of bedrooms and bathrooms:

  • bedrooms_squared = bedrooms*bedrooms
  • bed_bath_rooms = bedrooms*bathrooms
  • log_sqft_living = log(sqft_living)
  • lat_plus_long = lat + long

Some explanation:

  • Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this feature will mostly affect houses with many bedrooms.
  • Bedrooms times bathrooms gives what’s called an “interaction” feature. It is large when both of them are large.
  • Taking the log of square feet has the effect of bringing large values closer together and spreading out small values.
  • Adding latitude to longitude is totally non-sensical but we will do it anyway (you’ll see why).

You can see the model fit to these features and more in the Python notebook.

We can also fit a quadratic function and again we will see the training error going down, but what we would see is that as we increase the model complexity to a higher order of polynomial, we have a very low training error.

poly
A high-order polynomial function would fit well

Training error decreases quite significantly with model complexity. This is quite intuitive, because the model was fit on the training points and then as we increase the model complexity, we are better able to fit the training data points.

model4
Training error vs. model complexity

The test error

A natural question is whether a training error is a good measure of predictive performance?

The issue is that the training error is overly optimistic and that’s because the beta parameters were fit on the training data to minimise the residual sum of squares, which can often be related to the training error.
So, in general, having small training error does not imply having good predictive performance (unless the training data set is really representative of everything that you might see there out in the world).

This is intuitive. Suppose that we develop an algorithm to predict a stock’s price based on previous stock returns from the past 6 months. But we don’t really care how well our method predicts last week’s stock price. We instead care about how well it will predict tomorrow’s price or next month’s price.

This takes us to something called test error (or out-of-sample error): we hold out some houses from the data set and we’re putting these into what’s called a test set.
And when we fit our models, we just fit our models on the training data set. Only the training subset.
But then when we go to assess our performance of that model (i.e. when we compute the loss), we look at these test houses in the test dataset and these are hopefully serving as a proxy of everything out there in the world.

Bottom line, the test error serves as a (noisy) approximation of the true error.

 

Split data into training and testing data

Let’s see how can be applied to our example.
First we split the data into a training set and a testing set.
The training test will be used to fit the regression model, while the testing set will be used as a control group.

We can use a handy function from sklearn, the train_test_split():

from sklearn.model_selection import train_test_split
train_data,test_data = train_test_split(sales, test_size=0.3, random_state=999)

train_data.shape
(17290, 21)
test_data.shape
(4323, 21)

It needs as parameters at least the original data and how much of it should go into the testing set (a number from 0 to 1).  In this case the testing set will be the 30% (therefore the training set is 70% of the original data) and we use a random seed for reproducibility.

This raises the question: how big should be the test data?
If I put too few points in my training set then I’m not going to estimate my model well. But on the other hand if I put too few points in my test set, that will be  a bad approximation to the general error. Because it’s representing a wide enough range of things I might see out there in the world.
There’s no perfect formula for how to split a data set into training versus test. But a general rule of thumb is you want just enough points in your test set to form a reasonable estimate of the general error.
If this leaves too few points for the training set, other methods like cross-validation can be used and we will see them in later posts.

Now we will learn the weights for five new (nested) models . The first model will have the fewest features, the second model will add more and so on:

Model 1: squarefeet, # bedrooms, # bathrooms, latitude & longitude, lot, floors.
Model 2: add bedrooms*bathrooms, log of square feet, bedrooms squared
Model 3: add the (nonsensical) latitude + longitude
Model 4: add bedrooms^4 and bathrooms^7
Model 5: add size^3

You can see the full code in the Python notebook, here the results:

  # extract the price column
train_y = train_data.price
test_y = test_data.price
  # train the models on training data only:
model_1.fit(train_data[model_1_features], train_y)
model_2.fit(train_data[model_2_features], train_y)
model_3.fit(train_data[model_3_features], train_y)
model_4.fit(train_data[model_4_features], train_y)
model_5.fit(train_data[model_5_features], train_y)
# Compute the RSS on TRAINING data for each of the three models and record the values:
print (get_loss(model_1.predict(train_data[model_1_features]), train_y))
...

8.60021288255e+14
7.89347967237e+14
7.89347967237e+14
7.73197162481e+14
7.50061625701e+14
# Compute the RSS on TESTING data for each of the three models and record the values:
print (get_loss(model_1.predict(test_data[model_1_features]), test_y))
...
3.33694184144e+14
3.31182631706e+14
3.31182631706e+14
3.49204952637e+14
4.05978695042e+14

A couple of things interesting:

  • The most complex model has the lowest error on the training data.
  • Every nested model decrease the training error beside …
  • …the model with the non-sensical feature, lat plus long which is not improving it.
  • The testing error also improve by adding the interaction features and remains the same with the non-sensical feature but it gets worse when the model becomes more complex adding polynomial features, honestly hard to generalise.

Overfitting

When you have too many features in a model and the learned hypothesis fit the training set very well but fail to generalise to new data (predict prices on new houses) then this is called the overfitting problem.

Formally, a model, let’s say Model1 with some parameters beta_1, overfits if exists another model – let’s call it Model2, with estimated parameters beta_2 such that the training error of Model2 is less than the training error of Model1 but on the other hand, the true general error of Model2 is greater than the true error of Model1.

model4
The overfitting models

From the picture above you can see that the models prone to overfit are the ones that have small training error and high complexity.
Therefore one simple way to avoid overfitting is to prefer simpler models and avoid complex models with many features.
Another way is to carefully select which features to use, preferring the ones with a high correlation.
But we will see other more sophisticated ways (regularisation) to avoid overfitting, in future posts.

Signal and noise

Why is overfitting happening?
Data have two parts: signal and noise.
The goal of the predictor models is to find the meaningful signal in the data and we can always design a perfect in-sample predictor but doing so we capture both signal and the noise. This is unavoidable.

Data are inherently noisy.
For example, there’s some true relationship between size and the value of a house (or generically, between x and y).
And we can represent that relationship by a linear regression model, as we have seen at the beginning, with the simple model.
But of course that’s not a perfect description. There are lots of other contributing factors that are not included such as how a person feels on the purchase day or a personal relationship the buyer might have with the owners.
That is the noise and this is something that’s just a property of the data. We have no control over this. This has nothing to do with our model nor our estimation procedure, it’s just something that we have to deal with.

This is called irreducible error because it’s nothing that we can reduce through choosing a better model.
This brings us to two other important sources of error that we can control: the bias and the variance.

The bias

The bias is basically an assessment of how well the model can fit the true relationship between x and y.
Initially we had a data set recording houses that were sold, with their features and the price. Based on that data set we fitted some models but what if another set of houses had been sold? Then we would have had a different data set.  And when we went to fit our model, we would have gotten a different regression line.

It could be very different or very similar. And for all those possible fits, we can have a line representing our average fit, based over all those fits weighted by how likely they were.

Bias is the difference between this average fit and the true function of the relationship. And intuitively shows if our model is flexible enough to – on average – be able to capture the true relationship.
For very simple model, such as our baseline constant line model, this low complexity model has high bias. It’s not flexible enough to have a good approximation to the true relationship. And this leads to bias errors in our prediction.

The variance

Different training data sets will result in a different fit.
Variance refers to the amount by which a model would change if we fitted it using a different training data set.

How much can the fits vary?
Ideally the fit should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in the fit, then we would have very erratic predictions. The prediction would just be sensitive to what data set are gotten. That would be a source of error in the predictions: the variance error.

In general, more flexible statistical methods – high-complexity models  – have higher variance.  They match pretty well to the true relationship between size and house value, because the model is really flexible, therefore these high-complexities models would have low bias, but high variance.

You can think about this analogy: a judge can have a high bias if tends to give always a similar sentence independently of the differences between court cases. The judge will have instead high variance if the sentences are very different or erratic even if the court cases are similar. Of course, a good judge would have low bias and low variance.

The bias – variance trade-off

If we plot the bias and variance as a function of model complexity, we can see that as the model complexity increases, the bias decreases. Because we can better and better approximate the true relationship between x and y.

On the other hand, the variance increases. Our very simple model had very low variance and the higher-complexity models had higher variance.

What we see is that there is a natural tradeoff between bias and variance.
Good performance of a machine learning method requires low variance as well as low squared bias. This is referred to as a trade-off because it is easy to obtain a method with low bias but high variance (by drawing a curve that passes through every single training observation) or a method with low variance but high bias (by fitting a horizontal line to the data).
The challenge lies in finding a method for which both the variance and the squared bias are low.  And machine learning is all about this tradeoff between bias and variance.

The goal is finding the sweet spot where we get a minimum error – the minimum contribution of bias and variance – to the prediction errors. That’s the model complexity that we want.

biasvariance

But here is the catch: we cannot compute bias and variance.
In a real-life situation it is not possible to explicitly compute the test MSE or RMSE, bias or variance for a machine learning method. Because they are defined in terms of the true function. Bias is defined very explicitly in terms of the relationship relative to the true function. And to get the variance we have to average over all possible data sets and we just don’t know what that is.

Nevertheless, one should always keep the bias-variance trade-off in mind.
And there are ways to optimise this trade-off in a practical way, using cross-validation and regularisation (next post).

Note: this post is part of a series about Machine Learning with Python.

Advertisements

One thought on “The overfitting problem and the bias vs. variance dilemma

  1. Pingback: Cross-validation – Look back in respect

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s