We have seen how we can fit a model to existing data using linear regression. Now we want to assess how well the model describes those data points (every Outcome = Model + Error) and will use some statistics for it.
The following code is also available as a Notebook in GitHub.
As an example we access some available diamond ring data (from the Journal of Statistics Education): prices in Singapore dollars and weights in carats (the standard measure of diamond mass, equal to 0.2 g).
import pandas as pd diamondData = pd.read_csv("diamond.dat.txt", delim_whitespace=True, header=None, names=["carats","price"]) In [1]: diamondData.head() Out[1]:
carats 
price 

0 
0.17 
355 
1 
0.16 
328 
2 
0.17 
350 
3 
0.18 
325 
4 
0.25 
642 
Fit a model
Is there a relationship between the diamond price and its weight?
Our first goal should be to determine whether the data provide evidence of an association between price and carats. If the evidence is weak, then one might argue that bigger diamonds are not better!
To evaluate the model we will use a special Python package, statsmodel, which is a package based on the original statistics module of SciPy (Scientific Python) by Jonathan Taylor – later removed – corrected, improved, tested and released as a new package during the Google Summer of Code 2009.
Since statsmodels offers also functions to fit a linear regression model, we do not need to import and use sklearn to fit the model but we can do everything with statsmodels.
We will use its function OLS() that fits a linear regression based on the Ordinary Least Squares algorithm.
The model we want to get is :
y_hat = beta0 + beta1 * X
where y_hat is the estimated Diamond Price (the dependent variable) and x is the diamond Weight (the independent variable).
An intercept is not included by default and should be added by the user:
import statsmodels.api as sm # this append a column of ones X = sm.add_constant(diamondData.carats) simpleModel = sm.OLS(diamondData.price, X).fit() # fit the model
In [2]:simpleModel.params # get the coefficients Out[2]: const 259.625907 carats 3721.024852
The intercept (Beta0) is 259.6 and the slope for carats (Beta1) is 3721.
Therefore the resulting model is:
price = 259.6 + 3721 * carats
We can plot the obtained regression line together with the input data X.
We can do it by drawing a line using the beta parameters just calculated or also plotting the fitted values:
import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(1,1,1) ax.scatter(diamondData.carats, diamondData.price) # draw linear regression line y_hat = simpleModel.fittedvalues ax.plot(diamondData.carats, y_hat) # prettyup the plot fig.suptitle("Relation between diamonds' price and weight") ax.set_ylabel('Price [SIN $]') ax.set_xlabel('Weight [carat]') ax.grid(True)
This answers our first question.
There is a relationship between price and weight of diamond and we can model it.
Which kind of relation is that?
Next question could be to find out if the relation is linear and how it looks like.
Coefficients interpretation
Beta1 is the expected change in response for a 1 unit change in the predictor.
In our case, we expect 3721 Singapore dollars increase in price for every carat increase in mass of diamond (within the restricted range considered).
This is within the restricted range considered; extrapolation of the regression line for bigger diamond stones would not be advisable as these stones are rarer and command a different price range.
Beta0 is the expected price when the weight is zero.
This does not always make sense and in our case the negative intercept is even more puzzling because it suggests that a zerocarat diamond ring has a negative economic value!
Getting a more interpretable intercept
The intercept 259.63 is the expected price of a 0 carat diamond.
Which does not make much sense (unless you consider it the cost of bothering the diamond expert when you ask the price of a nonexisting diamond 🙂 )
It can be ignored (the model applies only to a restricted range of the data) or can be an indication that a different model could be more appropriate, for example a nonlinear regression.
We can also try to find the expected price for a more suitable weight, for example the average diamond weight:
In [3]: diamondData.carats.mean() Out[3]: 0.2041666666666667 In [4]: XmeanCentered = diamondData.carats  diamondData.carats.mean() In [5]: XmeanCentered = sm.add_constant(XmeanCentered) In [6]: meanCenteredModel = sm.OLS(diamondData.price, XmeanCentered).fit() In [7]: meanCenteredModel.params Out[7]: const 500.083333 carats 3721.024852
Thus $500 is the expected price for the average sized diamond of the sample (0.2042 carats for our dataset). Makes more sense.
As you can see, the slope is the same as the previous model, only the intercept shifted.
This is always valid when you shift your X values.
You can shift the X input by a certain value but you can also rescale them.
This can be useful when one unit is quite large and we would prefer a finer scale.
For example, in our case, one carat is worth of almost 4000 SIN$ and could make sense to talk about tenth of carats (= 1/10).
In [8]: Xtenth = diamondData.carats * 10 # rescale the X In [9]: Xtenth = sm.add_constant(Xtenth) In [10]: tenthModel = sm.OLS(diamondData.price, Xtenth).fit() In [11]: tenthModel.params Out[11]: const 259.625907 carats 372.102485
We expect a 372.102 (SIN) dollar change in price for every 1/10th of a carat increase in mass of diamond.
Note that this is equivalent to divide the slope coefficient by 10.
The intercept is the same as in the original model, only the slope coefficient is divided by 10.
This is always valid when you rescale the X values.
Predicting the price of a diamond
Once we have a model of the relation, we can use it for predictions.
The statsmodel package has a method predict() associated to each model, that takes a new set of input and will output the predicted values, according to the model.
Let’s say that I want to buy a 0.2 carats diamond. How much should I expect it to cost?
I can use the beta parameters estimated by the model and just putting them into the linear regression formula:
In [12]: simpleModel.params[0] + 0.2*simpleModel.params[1]
Out[12]: 484.57906311853975
I expect to pay around 485 SIN $.
Or I can use the predict() function available in the statsmodel package:
In [13]: newDiamond = [1, 0.2] # remember to add always the intercept! In [14]: simpleModel.predict(newDiamond)
Out[14]: array([ 484.57906312])
It’s also possible to pass a list of values to predict:
In [15]: newDiamonds = sm.add_constant([0.16, 0.27, 0.34])
In [16]: simpleModel.predict(newDiamonds)
Out[16]: array([ 335.73806906, 745.05080273, 1005.52254234])
Result: for 0.16, 0.27, and 0.34 carats, we predict the prices to be respectively 335.74, 745.05, 1005.52 (SIN) dollars
How strong is the relationship?
We know that there is a relationship between diamonds carats and prices, we would like to know the strength of this relationship. In other words, given a certain diamond weight, can we predict the price with a high level of accuracy? This would be a strong relationship. Or is a prediction of prices based on weight only slightly better than a random guess? This would be a weak relationship.
Residuals
As we have seen, the residuals are the difference between the observed (y) and the predicted outcome (y_hat):
In [17]: y_hat = simpleModel.fittedvalues
In [18]: y = diamondData.price
In [19]: max (abs (y  y_hat))
Out[19]: 85.158566087530232
Conveniently, the residuals are also stored in the results attribute resid:
In [20]: residuals = simpleModel.resid
In [21]: max(abs(residuals))
Out[21]: 85.158566087530232
85 SIN$ (per defect or excess) is the biggest error done by the model.
Don’t confuse errors and residuals.
The error is the deviation of the observed value from the (unobservable) true value of a quantity of interest (for example, a population mean), and the residual is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean).
We can learn many things from the residuals.
One is that their distribution and properties can give an indication about the model fit.
Residuals should not show any pattern
The residuals and their plot can highlight a poor model fit.
Here we plot the residuals versus the fitted values:
fig = plt.figure() ax = fig.add_subplot(1,1,1) ax.plot(simpleModel.fittedvalues, residuals, 'o') # as round marks # prettyup the plot ax.plot ((0, 1200), (0,0)) # draw also a line at y=0 fig.suptitle("Residuals versus fitted prices") ax.set_ylabel('Residuals [SIN $]') ax.set_xlabel('Price [SIN $]') ax.grid(True)
The residuals should be distributed uniformly without showing any pattern and having a constant variance.
If we see from the residuals vs. fitted plot that the variance of the residuals increases as the fitted values increase (takes a form of a horizontal cone) this is the sign of heteroscedasticity.
Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.
Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error term differs across values of an independent variable.
Examining the scatterplot of the residuals against the predicted values of the dependent variable would show the classic coneshaped pattern of heteroscedasticity.
Other patterns could be:
 curvilinear (indicate is nonlinear / missing higherorder term)
 a single point is far away from zero (probably an outlier)
 a single point is far away from the others in the xdirection (probably an influential point)
Residuals should be normally distributed
The sum of the residuals is expected to be zero (when there is an intercept).
This follows directly from the normal equation, i.e. the equation that the OLS estimator solves.
In [22]: sum(residuals)
Out[22]: 7.3896444519050419e13
The mean of the residuals is expected to be zero.
This comes directly from the fact that OLS minimises the sum of square residuals
In [23]: import numpy as np
In [24]: np.mean(residuals)
Out[24]: 1.5395092608135503e14
This is one of the assumptions for regression analysis: residuals should have a normal (or Gaussian) distribution.
In [25]: plt.hist(residuals) Out[25]:
It looks normal but we can verify better with a QQ plot.
QQ plot to verify the residuals distribution
QQ plots (stands for a “quantilequantile plot”) can be used to check whether the data is distributed normally or not.
It is a plot where the axes are purposely transformed in order to make a normal (or Gaussian) distribution appear in a straight line. In other words, a perfectly normal distribution would exactly follow a line with slope = 1 and intercept = 0.
Therefore, if the plot does not appear to be – roughly – a straight line, then the underlying distribution is not normal. If it bends up, then there are more “high flyer” values than expected, for instance.
The theoretical quantiles are placed along the xaxis. That is, the xaxis is not your data, it is simply an expectation of where your data should have been, if it were normal.
The actual data is plotted along the yaxis.
The values are the standard deviations from the mean. So, 0 is the mean of the data, 1 is 1 standard deviation above, etc. This means, for instance, that 68.27% of all your data should be between 1 & 1, if you have a normal distribution.
statsmodels offers a handy qqplot() function:
In [26]: sm.qqplot(residuals, fit=True, line = '45') Out[26]:
Estimating residual variation
The residual variation measures how well the regression line fits the data points.
It is the variation in the dependent variable (Price) that is not explained by the regression model and is represented by the residuals. We want the residual variation to be as small as possible.
Each residual is distributed normally with mean 0 and variance = sigma_squared.
We have previously seen that the ML Estimate of variance, sigma_squared, is sum(residuals squared) divided by n and we called it the Mean Squared Error (MSE).
Most people use (n2) instead of n so that the estimator is unbiased (the 2 is accounting for the degrees of freedom for intercept and slope).
The square root of the estimate, sigma, is called the Root Mean Squared Error (RMSE). We want both MSE and RMSE to be as small as possible.
In our diamonds example the estimated residual variation (unbiased RMSE) is :
In [27]: n = len(y)
In [28]: MSE = sum(residuals**2) / (n2)
In [29]: RMSE = np.sqrt(MSE)
In [30]: RMSE
Out[30]: 31.840522265031755
RMSE can be used to calculate the standardised residuals too.
They equal the value of a residual divided by an estimate of its standard deviation (so, RMSE).
Large standardised residuals are an indication of an outlier.
In [31]: max(simpleModel.resid / RMSE)
Out[31]: 2.4927258932276684
Summarizing the variation: Rsquared
The total variation is the residual variation (variation after removing predictors) plus the systematic variation (variation explained by regression model).
Rsquared is the percentage of variability explained by the regression model:
Rsquared = explained / total variation = 1 – residual / total variation
Rsquared is always between 0 and 1 (0% and 100%):
 0% indicates that the model explains none of the variability of the response data around its mean.
 100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the Rsquared, the better the model fits your data.
In [32]: simpleModel.rsquared
Out[32]: 0.97826077798603295
We are quite close to a perfect model.
You can use a fitted line plot to graphically illustrate different Rsquared values.
The more variation that is explained by the model, the closer the data points fall to the line. Theoretically, if a model could explain 100% of the variation, the fitted values would always equal the observed values and all of the data points would fall on the line.
fig = plt.figure() ax = fig.add_subplot(1,1,1) ax.plot(simpleModel.fittedvalues, diamondData.price, 'o') # as round marks # identity line plt.plot(simpleModel.fittedvalues, simpleModel.fittedvalues, '') # prettyup the plot fig.suptitle("Relation between estimated and actual diamonds' prices") ax.set_ylabel('Estimated Price [SIN $]') ax.set_xlabel('Actual Price [SIN $]') ax.grid(True)
Rsquared can be a misleading summary and needs to be carefully taken (deleting data can inflate Rsquared for example).
In conclusion (residuals distribution, variation) the model is pretty good and the relation is very strong.
Because of this, sometimes is preferred to use the adjusted Rsquared, which is Rsquared adjusted for the number of observations.
There are several formula that can be used, normally it is the Wherry’s formula:
In [33]: 1  (1simpleModel.rsquared)*((n1)/simpleModel.df_resid)
Out[33]: 0.97778818620312058
Of course, it is also available from the model results:
In [34]: simpleModel.rsquared_adj
Out[34]: 0.97778818620312058
Confidence
How accurately can we predict the diamond prices?
For any given weight in carats, what is our prediction for the price, and what is the accuracy of this prediction?
In statistics, a sequence of random variables is independent and identically distributed (IID) if each random variable has the same probability distribution as the others and all are mutually independent.
Inference for regression
In the case of regression with IID sampling assumptions and normal distributed residuals, the statistics for our estimated beta coefficients:
 will follow a finite sample Tdistributions and be normally distributed
 can be used to test null hypotesis
 can be used to create a confidence interval
In probability and statistics, the tdistribution is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.
Whereas a normal distribution describes a full population, tdistributions describe samples drawn from a full population.
The tdistribution becomes closer to the normal (Gaussian) distribution as its degrees of freedom (df) increases.
The tdistribution arises in a variety of statistical estimation problems where the goal is to estimate an unknown parameter, such as a mean value, in a setting where the data are observed with additive errors.
If the population standard deviation of these errors is unknown and has to be estimated from the data, the tdistribution is often used to account for the extra uncertainty that results from this estimation.
Confidence intervals and hypothesis tests are two statistical procedures in which the quantiles of the sampling distribution of a particular statistic (e.g. the standard score) are required.
Confidence levels are expressed as a percentage (for example, a 95% confidence level). It means that should you take a sample over and over again, 95 percent of the time your results will match the results you get from the population.
When the population standard deviation sigma is not known, an interval estimate for the population with confidence level (1alfa) is given by:
Xmean + t * (estimated standard error of the mean)
where t is a critical value determined from the tdistribution in such a way that there is an are (1alfa) between t and t.
First, we need to calculate the variance.
Estimating the coefficients and the variance
Recall that our linear regression model is:
Y = Beta0 + Beta1 * X + errors
We can define the beta parameters as:
beta0 = mean(Y) – beta1 * mean(X)
beta1 = Cor(Y,X) * Sd(Y)/Sd(X)
For our diamonds example:
# prepare the data In [35]: y = diamondData.price In [36]: x = diamondData.carats In [37]: n = len(y) # calculate beta1 In [38]: beta1 = (np.corrcoef (y,x) * np.std(y) / np.std(x))[0][1] In [39]: beta1 Out[39]: 3721.0248515504722 # calculate beta0 In [40]: beta0 = np.mean(y)  beta1 * np.mean(x) In [41]: beta0 Out[41]: 259.62590719155486
Sigma is unknown but its estimate is the squared root of the sum of the errors squared, divided by n2 (the degrees of freedom)
In [42]: e = y  beta0  beta1 * x # the residuals
# unbiased estimate for variance
In [43]: sigma = np.sqrt(sum(e**2) / (n2))
In [44]: sigma
Out[44]: 31.840522265031737
Confidence intervals
A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter.
Recall of quantiles and percentiles of a distribution
If you were the 95th percentile on an exam, it means that 95% of people scored worse than you and 5% scored better.
These are sample quantiles.
Now for a population: the ith quantile of a distribution with distribution function F is simply the point x_i so that :
F(x_i) = i
A percentile is simply a quantile with i expressed as a percent.
The 95th percentile of a distribution is the point where the probability that a random variable drawn from the population is less than 95%.
Approximately 68%, 95% and 99% of the normal density lies respectively within 1,2 and 3 standard deviations from the mean.
Estimating the Standard Errors
Now we need to calculate the standard errors.
In [45]: ssx = sum((x  np.mean(x))**2)
# calculate standard error for beta0
In [46]: seBeta0 = (1 / n + np.mean(x) ** 2 / ssx) ** 0.5 * sigma
In [47]: seBeta0
Out[47]: 17.318856186448148
# calculate standard error for beta1
In [48]: seBeta1 = sigma / np.sqrt(ssx)
In [49]: seBeta1
Out[49]: 81.785880366042491
The standard error of the parameter measures the precision of the estimate of the parameter.
The smaller the standard error, the more precise the estimate.
Hypothesis testing
Hypothesis testing is concerned with making decisions using data.
A null hypothesis is specified that represents the status quo, usually labeled H0.
The null hypothesis is assumed true and statistical evidence is required to reject it in favour of an alternative hypothesis.
Consider testing H0: mu = mu0
If we take the set of all possible values for which you fail to reject H0, this set is an alfa% confidence interval for mu, alfa depending on the set.
Getting the Tvalues
Testing for null hypotesis H0: estimated beta0 and beta1 are equal to real coefficients
Dividing the parameter by its standard error calculates a tvalue:
In [50]: tBeta0 = beta0 / seBeta0 In [51]: tBeta0 Out[51]: 14.990938454394572 In [52]: tBeta1 = beta1 / seBeta1 In [53]: tBeta1 Out[53]: 45.497154703190581
Pvalues
Pvalues are the most common measure of “statistical significance”.
The Pvalue is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than would be observed by chance alone.
If the pvalue is small, then either H0 is true and we have observed an extreme rare event or H0 is false.
Let’s say a Pvalue is 0.1: then the probability of seeing evidence as extreme or more extreme than what actually has been obtained under H0 is 0.1 (10%).
By reporting a Pvalue, any observer can perform the own hypothesis test at whatever alfa level they choose. If the Pvalue is less than alfa then you reject the null hypothesis.
Estimating the Pvalues for hypotesis contrary beta0 is not equal to zero
We can use the Tdistribution module from SciPy.stats to calculate the pvalues
from scipy.stats import t
The survival function of a random variable is the probability that the random variable is bigger than a value x.
The SciPy.stats function sf() returns this probability:
In [54]: degreesOfFreedom = simpleModel.df_resid # The residual degree of freedom In [55]: pBeta0 = t.sf(abs(tBeta0), df=degreesOfFreedom)*2 # twosided In [56]: pBeta1 = t.sf(abs(tBeta1), df=degreesOfFreedom)*2
Let’s summarise the values calculated until now:
print ("## Estimate Std. Error tvalue pvalue") print ("Intercept: ", beta0, seBeta0, tBeta0, pBeta0) print ("Carats: ", beta1, seBeta1, tBeta1, pBeta1) ## Estimate Std. Error tvalue pvalue Intercept: 259.625907192 17.3188561864 14.9909384544 2.52327062853e19 Carats: 3721.02485155 81.785880366 45.4971547032 6.75125983495e40
If the pvalue is less than the significance level (0.05 in our case) then the model explains the variation in the response.
T confidence intervals
For small samples we can use the tdistribution to calculate the confidence intervals.
The tdistribution has been invented by William Gosset in 1908 and is indexed by a degrees of freedom (df): it gets more like a standard normal as df gets larger.
ppf() is the Percent Point Function from SciPy.stats, that has as input the quantile (for a 2sided 95% probability) and the degrees of freedom.
In [57]: alpha=0.05 # confidence interval for twosided hypothesis
In [58]: qt = 1  (alpha/2) # =0.975 for a 2sided 95% probability
In [59]: t_value = t.ppf(qt, df=degreesOfFreedom)
In [60]: t_value
Out[60]: 2.0128955952945886
Now we can calculate the intervals for beta0 and beta1:
In [61]: limits=[1,1] In [62]: [beta0 + i*t_value*seBeta0 for i in limits] Out[62]: [294.48695652479677, 224.76485785831295] In [63]: [beta1 + i*t_value*seBeta1 for i in limits] Out[63]: [3556.3984132043752, 3885.6512898965693]
Interpretation: With 95% confidence, we estimate that 1 carat increase in diamond size results in a 3556 to 3886 increase in price in (Singapore) dollars.
Plot the confidence interval
We calculate the interval for each x value; will use the isf() function to get the inverse survival function:
predicted = simpleModel.fittedvalues x_1 = simpleModel.model.exog # just the x values plus column of 1 # get standard deviation of predicted values predvar = simpleModel.mse_resid + (x_1 * np.dot(simpleModel.cov_params(), x_1.T).T).sum(1) predstd = np.sqrt(predvar) tppf = t.isf(alpha/2.0, simpleModel.df_resid) interval_u = predicted + tppf * predstd interval_l = predicted  tppf * predstd fig, ax = plt.subplots() ax.plot(x,y, 'o', label="data") ax.plot(x, simpleModel.fittedvalues, 'g', label="OLS") ax.plot(x, interval_u, 'c', label = "Intervals") ax.plot(x, interval_l, 'c') # prettyup the plot fig.suptitle("OLS Linear Regression with confidence intervals") ax.set_ylabel('Predicted Price [SIN $]') ax.set_xlabel('Weight [Carat]') ax.grid(True) ax.legend(loc='best')
Summary of statistic values
The statsmodel package offers an overview of the model values, similar to what we calculated above:
In [64]: simpleModel.summary()
Out[64]: OLS Regression Results
Dep. Variable: 
price 
Rsquared: 
0.978 
Model: 
OLS 
Adj. Rsquared: 
0.978 
Method: 
Least Squares 
Fstatistic: 
2070. 
Date: 
Sat, 25 Aug 2014 
Prob (Fstatistic): 
6.75e40 
Time: 
13:50:30 
LogLikelihood: 
233.20 
No. Observations: 
48 
AIC: 
470.4 
Df Residuals: 
46 
BIC: 
474.1 
Df Model: 
1 

Covariance Type: 
nonrobust 
coef 
std err 
t 
P>t 
[95.0% Conf. Int.] 

const 
259.6259 
17.319 
14.991 
0.000 
294.487 224.765 
carats 
3721.0249 
81.786 
45.497 
0.000 
3556.398 3885.651 
Omnibus: 
0.739 
DurbinWatson: 
1.994 
Prob(Omnibus): 
0.691 
JarqueBera (JB): 
0.181 
Skew: 
0.056 
Prob(JB): 
0.913 
Kurtosis: 
3.280 
Cond. No. 
18.5 
Many values can also be accessed directly, for example the standard errors:
In [65]: simpleModel.bse Out[65]: const 17.318856 carats 81.785880
You can see all the values available using dir():
dir(simpleModel)
As an exercise, you can use the more complete diamonds data available on the same JSE website which has additional features and see if adding variables such as the diamond colour help to get a more precise prediction.
Pingback: Features selection for linear regression – Look back in respect