Multiple Linear Regression

In the Introduction to Linear Regression I have shown how is possible to use one single feature input – like the average season temperature – to predict another variable – like the wine price. Using an approach called Linear Regression.

But often we have many different features which we could use as inputs. As we have seen in the introduction example, other features – such as the amount of rain in the season – were available.
You might go in to the historical data set and see that two years had very similar average temperature but one was much more rainy than the other. Has this an effect on the wine quality and therefore on its price?  And if yes, which kind of effect, positive or negative? Small or big?

In this section we want to use multiple features (“the independent variables”) to predict the wine price (“the dependent variable”). In particular, in this higher dimensional space, we want to fit some kind of function that models the relationship between these inputs and the output.

Single-variable linear regression: a recall

Remember from the introduction that the basic regression model is the single variable regression model, where the independent variable – x – is only one and the dependent variable – y – is given by:

$y = \beta_{0} + \beta_{1}\cdot x$

Beta0 and Beta1 are specially selected coefficients:

• Beta0 = intercept of the line
• Beta1 = slope of the line

The linear regression model with the Average Temperature (AGST) as single feature looks then like this when we calculate it using the Gradient Descent method:

import numpy as np

# wine is the data frame with all the data
print("*** LR model with AGST")
features = wine[['AGST']] # input
prices = wine.Price # output

# append a ones column to the front of the data set
features.insert(0, 'intercept', 1)

# convert from data frames to numpy matrices
X = np.array(features.values)
y = np.array(prices.values)

theta = np.array([0.0, 0.0])
alpha = 1e-4
epsilon = 1e-5

# perform gradient descent to "fit" the model parameters
gdBetas, cost_history = gradientDescent(X, y, theta, alpha, epsilon)
print ("With AGST the weights are: ", gdBetas)

That returns:

*** LR with AGST
** converged after iterations: 210
With AGST the weights are:  [-3.02345593   0.61235899]

Which means that the LR formula with AGST as input is:

Price = -3.02 + 0.61 * AGST

Let’s calculate now the Residuals (the errors between our predictions and the actual values which we know).

We know that the SSE (Sum of Squared Errors) for the baseline (wine price = average price) is around 10.15

SSE baseline =  10.1506377256

# check the metrics for Price = -3.02 + 0.61*AGST
# calculate LR output based on weights (beta) and features (inputs)
def predict(beta, inputs):
return inputs.dot(beta)

# return loss metrics based on actual and expected values
def evaluateModel(predictions, actual):
# Sum of Squared Errors
SSE = np.sum(np.square(actual - predictions))
# Mean Squared Error
MSE = SSE / len(actual)
# Root Mean Squared Error
RMSE = sqrt(MSE)

return SSE, MSE, RMSE

# check the loss metrics

predictions = predict(gdBetas, features)
SSE, MSE, RMSE = evaluateModel(predictions, prices)

print ("RMSE using AGST = ", RMSE)
print ("R^2 using AGST = ", 1 - SSE / SSE_baseline)

RMSE using AGST = 0.4795631920221002
R^2 using AGST  = 0.4335802799798185

Recall that R2 is a number between 0 and 1; being 1 when it is a perfect prediction (no errors).

Multi-variable linear regression

Many different independent variables could be used to predict the wine price:

•  Average Growing Season Temperature (AGST)
•  Harvest Rain
•  Winter Rain
•  Age of Wine (in 1990)
•  Population of France

We know that the Price is correlated stronger with the Average Temperature, less with Winter Rain and Harvest Rain features:

In [1]: wine.corr()  # note: use only a subset of wine dataframe
Out[1]:
Year      Price     WinterRain AGST      HarvestRain
Year         1.000000 -0.447768  0.016970  -0.246916   0.028009
Price       -0.447768  1.000000  0.136651   0.659563  -0.563322
WinterRain   0.016970  0.136651  1.000000  -0.321091  -0.275441
AGST        -0.246916  0.659563 -0.321091   1.000000  -0.064496
HarvestRain  0.028009 -0.563322 -0.275441  -0.064496   1.000000

But what happens using multiple variables?

Multiple linear regression allows us to use some or all of these variables to improve our predictive ability. A model with k variables looks like:

$y = \beta_{0} + \beta_{1}\cdot x_{1} + \beta_{2}\cdot x_{2} + ... + \beta_{k}\cdot x_{k}$

One simple model with two features (e.g., the average temperature and the harvest rain) could be just this function as beta0 plus beta1 times the average temperature plus beta2 times the harvest rain:

$price = \beta_{0} + \beta_{1}\cdot AGST + \beta_{2}\cdot harvest rain$

The average temperature and the harvest rain are the known variables (input or features) and the beta parameters are selected to minimise the errors (SSE), exactly as for the single variable model.

One difficulty of adding more variables is that it become complex to visualise the scatter plots as we could do above for the single variable AGST.
A scatter plot with TWO input variables is a 3-D plot and the function price would be a plan instead of a line:

fig = plt.figure()

ax.scatter(wine.AGST, wine.HarvestRain, wine.Price, c='b', marker='o')

ax.set_xlabel('Temperature ')
ax.set_ylabel('Harvest Rain')
ax.set_zlabel('Log of Price')

plt.show()

Adding even more variables means increasing the number of dimensions and the function becoming a hyperplane. Definitely complex to visualise!

Let´s see now how this model works.

The good news is that the gradient descent descent algorithm outlined for the single variable is working with no further changes for multiple variables, thanks to its inputs being vectorised.

print ("*** Now adding HarvestRain as input")

features = wine[['AGST', 'HarvestRain']] # TWO features as input
# append a ones column to the front of the data set
features.insert(0, 'intercept', 1)
X = np.array(features.values)

theta = np.array([0.0, 0.0, 0.0]) # THREE beta parameters

gdBetas, cost_history = gradientDescent(X, y, theta, alpha, epsilon)
print ("With AGST + HarvestRain the weights are:  ", gdBetas)
predictions = predict(gdBetas, features)
SSE, MSE, RMSE = evaluateModel(predictions, prices)

print ("RMSE using AGST + harvest_rain = ", RMSE)
print ("R^2 using AGST + harvest_rain  = ", 1 - SSE / SSE_baseline)

And now the results are:

*** Now adding HarvestRain
** converged after iterations: 4
With AGST + HarvestRain the weights are:
[-2.00000341   0.59994362   -0.00534501]
RMSE using AGST + harvest_rain = 0.3519669642545581
R^2 using AGST + harvest_rain  = 0.694894184790624

Using in addition the harvest rain feature, the error decreased and the R2 metric jumped to 0.69 !
Definitely it helps to consider two features instead of one.
Let’s see if using also the winter rain feature can further improve the
model.

print("*** Now adding WinterRain ")

features = wine[['AGST', 'HarvestRain', 'WinterRain']] # now 3 features
# append a ones column to the front of the data set
features.insert(0, 'intercept', 1)
X = np.array(features.values)

theta = np.array([0.0, 0.0, 0.0, 0.0]) # now 4 beta parameters
gdBetas, cost_history = gradientDescent(X, y, theta, alpha, epsilon)

The results with three features are:

*** Now adding WinterRain
** converged after iterations: 5
With AGST + HarvestRain + WinterRain the weights are:
[ -4.30000052   6.99991322e-01   -4.07910197e-03   7.23013020e-04]
RMSE using AGST,harvest and winter rain = 0.32299337051627536
R^2 using AGST,harvest and winter rain  = 0.7430587116355358

Yes, the error decreased more and the R2 is better but only slightly improved.

This is the downside of adding more and more features: the improvements diminish and you risk to “overfit” the model, making it too complex and too adherent to the initial data set but away from unseen data (more on this in a next post).

Interpreting the coefficients

Just like we did in simple linear regression, we can think about interpreting
the beta coefficients of our estimated function:

Price = Beta0 + Beta1 * AGST

What we said was that Beta1 was the predicted change in our output per unit change in our input. So, predicted change in the wine price for one degree change and we have seen that each grade more increased the wine value of 0.6

How do we interpret the coefficients if we have two inputs, instead of just one?

Price = Beta0 + Beta1 * AGST + Beta2 * HarvestRain

In this case we can fix one input, the average temperature. And then think about what’s the effect of the amount of rain.
Doing that is equivalent to taking a slice (red in the picture below) through the function space (blue in the picture) for however many temperature degrees we’re looking at.

When you slice the hyperplane living here in 3D space you just get a line. A line for
fixed number of temperature degrees.
The interpretation of Beta2 – the coefficient on the amount of rain – is the slope of this line, so that means for a single unit change in our input, which here is millimeter of rain.

We fix a temperature for a given harvest season (let’s say 16.3 degrees) and we can see that – for this given temperature – adding one millimeter of rain will increase the value of my wine of Beta2. That’s the interpretation of Beta2.

The same for the other inputs. I can fix the rain parameter and Beta1 will give the change in wine value for any unit change for temperature, given that fixed amount of rain. And so on.

Variables pre-requisites

Not all available variables should be used: beside the overfitting problem, each new variable requires more data. We will see later how to appropriately choose variables.

One obvious criteria is that the variables should be independent, not correlated between them. If I use two variables which are correlated between them, I do not add value to the model.
In the wine data set two variables highly correlated (more than 0.7) are the France Population and the Year (i.e., the population increased regularly year after year) and in fact their correlation is 0.99
Clearly using both of them will not add anything to the model.

Using all features beside the highly correlated one, makes our wine model to have a value of R2 = 0.83

This tells us that a linear regression model with only a few variables can predict wine prices well. In many cases, it outperforms wine experts’ opinions.
A classic case of quantitative approach to a traditionally qualitative problem.