# Cross-validation

We have seen previously how learning the parameters of a prediction function on the same data would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice to hold out part of the available data as a test set.

The test error would be then the average error that results when predicting the response on a new observation, one that was not used in training the learning method.
In contrast, the training error is calculated by applying the learning method to the observations used in its training.
But the training error rate often is quite different from the test error rate, and in particular it can dramatically underestimate the general error.

The best solution would be to use a large designated test set, that often is not available.

Here we see a class of methods that estimate the test error by holding out a subset of the training observations from the fitting process, and then applying the learning method to those held out observations.

# An example with automobile data

I will use the dataset “auto”  from the UCI Machine Learning Repository.

Specifically, we will use the feature “horsepower” (i.e., the car’s HP number) to predict its fuel consumption (the “mpg” feature).

You can follow along on the associated notebook.  This post is part of a series about Machine Learning with Python.

```import pandas as pd

autoData = pd.read_csv('datasets/Auto.csv', na_values='?')

X = autoData[['horsepower']]
m = X.shape[0] # number of examples (=392)

Y = autoData.mpg.copy() # copy “y” column values out
```

We can see the data distribution via a scatter chart:

```import matplotlib.pyplot as plt

fig = plt.scatter(X, Y)
fig = plt.grid(True)
fig = plt.xlabel('Horsepower')
fig = plt.ylabel('Miles per Gallon (mpg)')
```

## Fit a linear regression model

First, we fit a simple linear model:

```from sklearn import linear_model
#
# Create the linear regression model
#
model = linear_model.LinearRegression()

model.fit(X, Y)

print (model.intercept_)
print(model.coef_)
```
``````39.93586102117046
array([-0.15784473])``````

The model looks like mpg = 39.9 – 0.15*horsepower.

And plotting the line:

We can already see that a simple linear regression model will not be very accurate … probably a curve would better approximate the real fuel consumption than a line.
Let’s measure the accuracy using the test-set hold-out approach.

# Test-set approach

Here we randomly divide the available set of samples into two parts: a training set and a test or hold-out set.
The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the test set.
The resulting test-set error provides an estimate of the general error.

In scikit-learn a random split into training and test sets can be quickly computed with the `train_test_split` helper function.

```from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
```

I used 20% of the original dataset as a test set.

## Measure the model accuracy

We fit the model on the train dataset and measure the accuracy on the test dataset.
Note: the score() function is a shortcut to run predict() and then calculate the accuracy.

```model.fit(x_train, y_train)

print('Test Accuracy: {:.3f}'.format(model.score(x_test, y_test)))
```
`Test Accuracy: 0.587`

As expected, the accuracy is not very high, around 59%.

## Drawbacks of test set approach

The validation estimate of the test error can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the test set. Every time you split the data into different train/test datasets you get different models and scores. You can try it yourself by removing the fixed random seed above.

This suggests that the test set error may tend to overestimate the error for the model fit on the entire data set. Because only a subset of the observations — those that are included in the training set — are used to fit the model and, in general, the more data the lower the error.

Another drawback is that, in typical machine learning applications, we are also interested in tuning and comparing different parameter settings (model selection via hyper-parameters). However, if we reuse the same test set over and over again during model selection, it will become part of our training data and thus the model will be more likely to overfit because the parameters can be tweaked until the estimator performs optimally. In this way, knowledge about the test set can “leak” into the model and evaluation metrics such as the test error no longer report on generalisation performance.

A better way of using the holdout method for model selection is to separate the data into three parts: a training set, a validation set and a test set. The training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

But, by partitioning the available data into three sets, we drastically reduce the number of observations which can be used for learning the model and the results can depend on a particular random choice for the pair of (train, validation) sets.

# Cross-validation approach

A solution to this problem is a widely used procedure called cross-validation (CV for short).
A test set could still be held out for final evaluation but the validation set is no longer needed when doing CV.
In the basic approach, called k-fold CV, the training set is randomly divided into k roughly equal-sized (beside possibly the last one) smaller parts, called folds.
Then we leave out one of the folds and a model is trained using the other k-1 folds (combined). The resulting model is validated on the left out k-th fold.
This is done in turn for each fold  1, 2, … K.

The performance measure reported by k-fold cross-validation is then the average of the values computed for each fold.
This approach can be computationally expensive but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problems where the number of observations is very small.

In sklearn this can be done using the KFold function from the package model_selection:

```from sklearn.model_selection import KFold
import numpy as np
# kf are the folds:
kf = KFold(n_splits = 10).split(X, Y)

scores = []
for k, (train_index, test_index) in enumerate(kf):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]
model.fit(X_train, Y_train)
scoreK = model.score(X_train, Y_train)
scores.append(scoreK)

print('Fold: {:2d}, Acc: {:.3f}'.format(k+1, scoreK))

print('\nCV accuracy: {:.3f} +/- {:.3f}'.format(np.mean(scores),
np.std(scores)))
```
``` Fold: 1, Acc: 0.595
Fold: 2, Acc: 0.608
Fold: 3, Acc: 0.613
Fold: 4, Acc: 0.614
Fold: 5, Acc: 0.604
Fold: 6, Acc: 0.605
Fold: 7, Acc: 0.602
Fold: 8, Acc: 0.610
Fold: 9, Acc: 0.610

CV accuracy: 0.607 +/- 0.006```

Now the accuracy of 0.607 is a much more reliable measure than the one from the test-set approach, which could change depending on which test set has been randomly selected.

## Leave One Out (LOO) cross-validation

Setting K = n (the number of observations) yields n-fold and is called leave-one out cross-validation (LOO), a special case of the K-fold approach.

LOO CV is sometimes useful but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance.
This is why the usual choice is K=5 or 10. It provides a good compromise for the bias-variance tradeoff.
We can see anyway the LOO approach too, using the helper sklearn function LeaveOneOut():

```from sklearn.model_selection import LeaveOneOut

LOO = LeaveOneOut().split(X, Y)

scores = []
for k, (train_index, test_index) in enumerate(LOO):

X_train, X_test = X.iloc[train_index], X.iloc[test_index]
Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]

model.fit(X_train, Y_train)
scoreK = model.score(X_train, Y_train)
scores.append(scoreK)

print('\nCV accuracy: {:.3f} +/- {:.3f}'.format(np.mean(scores),
np.std(scores)))
```
`CV accuracy: 0.606 +/- 0.001`

## A generic cross-validation score evaluating function.

The sklearn helper cross_val_score can quickly estimate the score of a model, using one CV approach:

```from sklearn.model_selection import cross_val_score
# Passing the entirety of X and y, not X_train or y_train
scores = cross_val_score(model, X, Y, cv=KFold(n_splits=5, shuffle=True,
random_state=1))
print ("Cross-validated scores:", scores)
print('CV accuracy: {:.3f}  +/- {:.3f} '.format (np.mean(scores), np.std(scores)))
```
```Cross-validated scores: [ 0.58656677 0.65131449 0.56557775 0.54515335 0.66048729]
CV accuracy: 0.602 +/- 0.046```

In the first iteration, the accuracy is 58%, in the second iteration the accuracy is 65% and so on.

## A better model with polynomial features

Now we try to improve our simple model’s accuracy by adding polynomial terms to our feature, i.e. transforming it into its power.
First we fit a model : mpg = beta0 + beta1 * horsepower^2

```from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2) # quadratic function
# we first transform our data
x_train2 = poly.fit_transform(x_train)
x_test2 = poly.fit_transform(x_test)
# then we fit the model
model2 = linear_model.LinearRegression()
model2.fit(x_train2, y_train)

model2.score(x_test2, y_test)
```
`0.69134645952506046`

It looks better, so we can try with a cubic function:

```poly = PolynomialFeatures(degree=3) # cubic function
x_train3 = poly.fit_transform(x_train)
x_test3 = poly.fit_transform(x_test)

model3 = linear_model.LinearRegression()
model3.fit(x_train3, y_train)

model3.score(x_test3, y_test)
```
`0.69099400625639762`

These results show that a model which predicts mpg using a quadratic function of horsepower performs better than a model that involves only a linear function of horsepower and there is little evidence in favour of a model that uses a cubic function of horsepower.

# Using CV to tune parameters

We can repeat this procedure for increasingly complex polynomial fits.
To automate the process, we use a loop which iteratively fits polynomial regressions for polynomials of order i = 1 to i = 15, computes the associated cross-validation score and stores it.

```# search for an optimal value of poly n for LR

# range of n we want to try
n_range = range(1, 15)
# empty list to store scores
n_scores = []
model_n = linear_model.LinearRegression()

# 1. we will loop through values of n
for n in n_range:
# 2. transform features with n degree
poly = PolynomialFeatures(degree=n)
x_val_n = poly.fit_transform(X)

# 3. obtain cross_val_score for model_n
scores = cross_val_score(model_n, x_val_n, Y, cv=KFold(n_splits=5,
shuffle=True,
random_state=1))
# 4. append mean of scores to n_scores list
n_scores.append(scores.mean())

#print ("Cross-validated scores:", scores)

print('CV accuracy: {:.3f} +/- {:.3f} '.format(np.mean(n_scores), np.std(n_scores)))
```
`CV accuracy: 0.641 +/- 0.075`
```# plot how accuracy changes as we vary the polynomial degree

plt.plot(n_range, n_scores)
plt.xlabel('Degree for horsepower')
plt.ylabel('Cross-validated accuracy')
```

This is quite a typical shape of the curve when examining the model complexity and accuracy, first increasing then reaching the maximum CV accuracy (in this case, it occurs from degree=2 to 12 and then quickly decreasing (overfit).
This is an example of bias-variance trade off.

We see a sharp improvement in the estimated accuracy between the linear and quadratic fits, but then no clear improvement from using higher-order polynomials.
Therefore, the best model is the one with the quadratic function.