Inference statistics for linear regression

We have seen how we can fit a model to existing data using linear regression. Now we want to assess how well the model describes those data points (every Outcome = Model + Error) and will use some statistics for it.

The following code is also available as a Notebook in GitHub.

As an example we access some available diamond ring data (from the Journal of Statistics Education): prices in Singapore dollars and weights in carats (the standard measure of diamond mass, equal to 0.2 g). 

import pandas as pd

diamondData = pd.read_csv("diamond.dat.txt", delim_whitespace=True,
                          header=None, names=["carats","price"])
In [1]: diamondData.head()
Out[1]:

carats

price

0

0.17

355

1

0.16

328

2

0.17

350

3

0.18

325

4

0.25

642

Fit a model

Is there a relationship between the diamond price and its weight?

Our first goal should be to determine whether the data provide evidence of an association between price and carats. If the evidence is weak, then one might argue that bigger diamonds are not better!

To evaluate the model we will use a special Python package, statsmodel, which is a package based on the original statistics module of SciPy (Scientific Python) by Jonathan Taylor – later removed – corrected, improved, tested and released as a new package during the Google Summer of Code 2009.

statsmodels_hybi_banner

Since statsmodels offers also functions to fit a linear regression model, we do not need to import and use sklearn to fit the model but we can do everything with statsmodels.

We will use its function OLS() that fits a linear regression based on the Ordinary Least Squares algorithm.

The model we want to get is :

y_hat = beta0 + beta1 * X

where y_hat is the estimated Diamond Price (the dependent variable) and x is the diamond Weight (the independent variable).

An intercept is not included by default and should be added by the user:

import statsmodels.api as sm

    # this append a column of ones
X = sm.add_constant(diamondData.carats) 

simpleModel = sm.OLS(diamondData.price, X).fit() # fit the model
In [2]:simpleModel.params  # get the coefficients
Out[2]: const     -259.625907
        carats    3721.024852

The intercept (Beta0) is -259.6 and the slope for carats (Beta1)  is 3721.
Therefore the resulting model is:

price = -259.6 + 3721 * carats

We can plot the obtained regression line together with the input data X.
We can do it by drawing a line using the beta parameters just calculated or also plotting the fitted values:

import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(1,1,1)

ax.scatter(diamondData.carats, diamondData.price)
  # draw linear regression line
y_hat = simpleModel.fittedvalues
ax.plot(diamondData.carats, y_hat)
  # pretty-up the plot
fig.suptitle("Relation between diamonds' price and weight")
ax.set_ylabel('Price [SIN $]')
ax.set_xlabel('Weight [carat]')
ax.grid(True)

lr

This answers our first question.
There is a relationship between price and weight of diamond and we can model it.

Which kind of relation is that?

Next question could be to find out if the relation is linear and how it looks like.

Coefficients interpretation

Beta1 is the expected change in response for a 1 unit change in the predictor.
In our case, we expect 3721 Singapore dollars increase in price for every carat increase in mass of diamond (within the restricted range considered).
This is within the restricted range considered; extrapolation of the regression line for bigger diamond stones would not be advisable as these stones are rarer and command a different price range.

Beta0 is the expected price when the weight is zero.
This does not always make sense and in our case the negative intercept is even more puzzling because it suggests that a zero-carat diamond ring has a negative economic value!

Getting a more interpretable intercept

The intercept -259.63 is the expected price of a 0 carat diamond.
Which does not make much sense (unless you consider it the cost of bothering the diamond expert when you ask the price of a non-existing diamond 🙂 )

It can be ignored (the model applies only to a restricted range of the data) or can be an indication that a different model could be more appropriate, for example a non-linear regression.

We can also try to find the expected price for a more suitable weight, for example the average diamond weight:

In [3]: diamondData.carats.mean()
Out[3]: 0.2041666666666667

In [4]: XmeanCentered = diamondData.carats - diamondData.carats.mean()
In [5]: XmeanCentered = sm.add_constant(XmeanCentered) 

In [6]: meanCenteredModel = sm.OLS(diamondData.price, XmeanCentered).fit()
In [7]: meanCenteredModel.params

Out[7]: const      500.083333
        carats    3721.024852

Thus $500 is the expected price for the average sized diamond of the sample (0.2042 carats for our dataset). Makes more sense.

As you can see, the slope is the same as the previous model, only the intercept shifted.
This is always valid when you shift your X values.

You can shift the X input by a certain value but you can also re-scale them.

This can be useful when one unit is quite large and we would prefer a finer scale.
For example, in our case, one carat is worth of almost 4000 SIN$ and could make sense to talk about tenth of carats (= 1/10).

In [8]: Xtenth = diamondData.carats * 10 # rescale the X
In [9]: Xtenth = sm.add_constant(Xtenth)

In [10]: tenthModel = sm.OLS(diamondData.price, Xtenth).fit()
In [11]: tenthModel.params

Out[11]: const    -259.625907
         carats    372.102485

 We expect a 372.102 (SIN) dollar change in price for every 1/10th of a carat increase in mass of diamond.
Note that this is equivalent to divide the slope coefficient by 10.

The intercept is the same as in the original model, only the slope coefficient is divided by 10.
This is always valid when you re-scale the X values.

Predicting the price of a diamond

Once we have a model of the relation, we can use it for predictions.
The statsmodel package has a method predict() associated to each model, that takes a new set of input and will output the predicted values, according to the model.

Let’s say that I want to buy a 0.2 carats diamond. How much should I expect it to cost?

I can use the beta parameters estimated by the model and just putting them into the linear regression formula:

In [12]: simpleModel.params[0] + 0.2*simpleModel.params[1]
Out[12]: 484.57906311853975

I expect to pay around 485 SIN $.

Or I can use the predict() function available in the statsmodel package:

In [13]: newDiamond = [1, 0.2] # remember to add always the intercept!
In [14]: simpleModel.predict(newDiamond)
Out[14]: array([ 484.57906312])

It’s also possible to pass a list of values to predict:

In [15]: newDiamonds = sm.add_constant([0.16, 0.27, 0.34])
In [16]: simpleModel.predict(newDiamonds)
Out[16]: array([  335.73806906,   745.05080273,  1005.52254234])

Result: for 0.16, 0.27, and 0.34 carats, we predict the prices to be respectively 335.74, 745.05, 1005.52 (SIN) dollars

How strong is the relationship?

We know that there is a relationship between diamonds carats and prices, we would like to know the strength of this relationship. In other words, given a certain diamond weight, can we predict the price with a high level of accuracy? This would be a strong relationship. Or is a prediction of prices based on weight  only slightly better than a random guess? This would be a weak relationship.

Residuals

As we have seen, the residuals are the difference between the observed (y) and the predicted outcome (y_hat):

In [17]: y_hat = simpleModel.fittedvalues
In [18]: y = diamondData.price

In [19]: max (abs (y - y_hat))
Out[19]: 85.158566087530232

Conveniently, the residuals are also stored in the results attribute resid:

In [20]: residuals = simpleModel.resid
In [21]: max(abs(residuals))
Out[21]: 85.158566087530232

85 SIN$ (per defect or excess) is the biggest error done by the model.

Don’t confuse errors and residuals.
The error is the deviation of the observed value from the (unobservable) true value of a quantity of interest (for example, a population mean), and the residual is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean).

We can learn many things from the residuals.
One is that their distribution and properties can give an indication about the model fit.

Residuals should not show any pattern

The residuals and their plot can highlight a poor model fit.

Here we plot the residuals versus the fitted values:

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(simpleModel.fittedvalues, residuals, 'o') # as round marks

  # pretty-up the plot
ax.plot ((0, 1200), (0,0)) # draw also a line at y=0
fig.suptitle("Residuals versus fitted prices")
ax.set_ylabel('Residuals [SIN $]')
ax.set_xlabel('Price [SIN $]')
ax.grid(True)

resid

The residuals should be distributed uniformly without showing any pattern and having a constant variance.

If we see from the residuals vs. fitted plot that the variance of the residuals increases as the fitted values increase (takes a form of a horizontal cone) this is the sign of heteroscedasticity.

Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.

Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error term differs across values of an independent variable.

Examining the scatterplot of the residuals against the predicted values of the dependent variable would show the classic cone-shaped pattern of heteroscedasticity.

hs

Other patterns could be:

  • curvilinear (indicate is non-linear / missing higher-order term)
  • a single point is far away from zero (probably an outlier)
  • a single point is far away from the others in the x-direction (probably an influential point)

Residuals should be normally distributed

The sum of the residuals is expected to be zero (when there is an intercept).
This follows directly from the normal equation, i.e. the equation that the OLS estimator solves.

In [22]: sum(residuals)
Out[22]: -7.3896444519050419e-13

The mean of the residuals is expected to be zero.
This comes directly from the fact that OLS minimises the sum of square residuals

In [23]: import numpy as np
In [24]: np.mean(residuals)
Out[24]: -1.5395092608135503e-14

This is one of the assumptions for regression analysis: residuals should have a normal (or Gaussian) distribution.

In [25]: plt.hist(residuals)
Out[25]:

histr

It looks normal but we can verify better with a Q-Q plot.

Q-Q plot to verify the residuals distribution

Q-Q plots (stands for a “quantile-quantile plot”) can be used to check whether the data is distributed normally or not.

It is a plot where the axes are purposely transformed in order to make a normal (or Gaussian) distribution appear in a straight line. In other words, a perfectly normal distribution would exactly follow a line with slope = 1 and intercept = 0.

Therefore, if the plot does not appear to be – roughly – a straight line, then the underlying distribution is not normal. If it bends up, then there are more “high flyer” values than expected, for instance.

The theoretical quantiles are placed along the x-axis. That is, the x-axis is not your data, it is simply an expectation of where your data should have been, if it were normal.

The actual data is plotted along the y-axis.

The values are the standard deviations from the mean. So, 0 is the mean of the data, 1 is 1 standard deviation above, etc. This means, for instance, that 68.27% of all your data should be between -1 & 1, if you have a normal distribution.

statsmodels offers a handy qqplot() function:

In [26]: sm.qqplot(residuals, fit=True, line = '45')
Out[26]:

qq

Estimating residual variation

The residual variation measures how well the regression line fits the data points.

It is the variation in the dependent variable (Price) that is not explained by the regression model and is represented by the residuals. We want the residual variation to be as small as possible.

Each residual is distributed normally with mean 0 and variance = sigma_squared.

We have previously seen that the ML Estimate of variance, sigma_squared, is sum(residuals squared) divided by n and we called it the Mean Squared Error (MSE).

Most people use (n-2) instead of n so that the estimator is unbiased (the -2 is accounting for the degrees of freedom for intercept and slope).

The square root of the estimate, sigma, is called the Root Mean Squared Error (RMSE). We want both MSE and RMSE to be as small as possible.

In our diamonds example the estimated residual variation (unbiased RMSE) is :

In [27]: n = len(y)
In [28]: MSE = sum(residuals**2) / (n-2)
In [29]: RMSE = np.sqrt(MSE)
In [30]: RMSE
Out[30]: 31.840522265031755

RMSE can be used to calculate the standardised residuals too.
They equal the value of a residual divided by an estimate of its standard deviation (so, RMSE).
Large standardised residuals are an indication of an outlier.

In [31]: max(simpleModel.resid / RMSE)
Out[31]: 2.4927258932276684

Summarizing the variation: R-squared

The total variation is the residual variation (variation after removing predictors) plus the systematic variation (variation explained by regression model).

R-squared is the percentage of variability explained by the regression model:

R-squared = explained / total variation = 1 – residual / total variation

R-squared is always between 0 and 1 (0% and 100%):

  • 0% indicates that the model explains none of the variability of the response data around its mean.
  • 100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data.

In [32]: simpleModel.rsquared
Out[32]: 0.97826077798603295

We are quite close to a perfect model.

You can use a fitted line plot to graphically illustrate different R-squared values.
The more variation that is explained by the model, the closer the data points fall to the line. Theoretically, if a model could explain 100% of the variation, the fitted values would always equal the observed values and all of the data points would fall on the line.

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(simpleModel.fittedvalues, diamondData.price, 'o') # as round marks
  # identity line
plt.plot(simpleModel.fittedvalues, simpleModel.fittedvalues, '--') 
  # pretty-up the plot
fig.suptitle("Relation between estimated and actual diamonds' prices")
ax.set_ylabel('Estimated Price [SIN $]')
ax.set_xlabel('Actual Price [SIN $]')
ax.grid(True)

fitted

R-squared can be a misleading summary and needs to be carefully taken (deleting data can inflate R-squared for example).

In conclusion (residuals distribution, variation) the model is pretty good and the relation is very strong.

Because of this, sometimes is preferred to use the adjusted Rsquared, which is Rsquared adjusted for the number of observations.
There are several formula that can be used, normally it is the Wherry’s formula:

In [33]: 1 - (1-simpleModel.rsquared)*((n-1)/simpleModel.df_resid)
Out[33]: 0.97778818620312058

Of course, it is also available from the model results:

In [34]: simpleModel.rsquared_adj
Out[34]: 0.97778818620312058

Confidence

How accurately can we predict the diamond prices?

For any given weight in carats, what is our prediction for the price, and what is the accuracy of this prediction?

In statistics, a sequence of random variables is independent and identically distributed (IID) if each random variable has the same probability distribution as the others and all are mutually independent.

Inference for regression

In the case of regression with IID sampling assumptions and normal distributed residuals, the statistics for our estimated beta coefficients:

  • will follow a finite sample T-distributions and be normally distributed
  • can be used to test null hypotesis
  • can be used to create a confidence interval

In probability and statistics, the t-distribution is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.

Whereas a normal distribution describes a full population, t-distributions describe samples drawn from a full population.

The t-distribution becomes closer to the normal (Gaussian) distribution as its degrees of freedom (df) increases.

tdistr

The t-distribution arises in a variety of statistical estimation problems where the goal is to estimate an unknown parameter, such as a mean value, in a setting where the data are observed with additive errors.

If the population standard deviation of these errors is unknown and has to be estimated from the data, the t-distribution is often used to account for the extra uncertainty that results from this estimation.

Confidence intervals and hypothesis tests are two statistical procedures in which the quantiles of the sampling distribution of a particular statistic (e.g. the standard score) are required.

Confidence levels are expressed as a percentage (for example, a 95% confidence level). It means that should you take a sample over and over again, 95 percent of the time your results will match the results you get from the population.

When the population standard deviation sigma is not known, an interval estimate for the population with confidence level (1-alfa) is given by:

Xmean +- t * (estimated standard error of the mean)

where t is a critical value determined from the t-distribution in such a way that there is an are (1-alfa) between t and -t.

tdecrule

First, we need to calculate the variance.

Estimating the coefficients and the variance

Recall that our linear regression model is:

Y = Beta0 + Beta1 * X + errors

We can define the beta parameters as:

beta0 = mean(Y) – beta1 * mean(X)
beta1 = Cor(Y,X) * Sd(Y)/Sd(X)

For our diamonds example:

  # prepare the data
In [35]: y = diamondData.price
In [36]: x = diamondData.carats
In [37]: n = len(y)

  # calculate beta1
In [38]: beta1 = (np.corrcoef (y,x) * np.std(y) / np.std(x))[0][1]
In [39]: beta1
Out[39]: 3721.0248515504722

  # calculate beta0
In [40]: beta0 = np.mean(y) - beta1 * np.mean(x)
In [41]: beta0
Out[41]: -259.62590719155486

Sigma is unknown but its estimate is the squared root of the sum of the errors squared, divided by n-2 (the degrees of freedom)

In [42]: e = y - beta0 - beta1 * x # the residuals
  # unbiased estimate for variance
In [43]: sigma = np.sqrt(sum(e**2) / (n-2))
In [44]: sigma
Out[44]: 31.840522265031737

Confidence intervals

A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter.

Recall of quantiles and percentiles of a distribution

If you were the 95th percentile on an exam, it means that 95% of people scored worse than you and 5% scored better.

These are sample quantiles.

Now for a population: the i-th quantile of a distribution with distribution function F is simply the point x_i so that :

F(x_i) = i

A percentile is simply a quantile with i expressed as a percent.
The 95th percentile of a distribution is the point where the probability that a random variable drawn from the population is less than 95%.

Approximately 68%, 95% and 99% of the normal density lies respectively within 1,2 and 3 standard deviations from the mean.

Estimating the Standard Errors

Now we need to calculate the standard errors.

In [45]: ssx = sum((x - np.mean(x))**2)

  # calculate standard error for beta0
In [46]: seBeta0 = (1 / n + np.mean(x) ** 2 / ssx) ** 0.5 * sigma
In [47]: seBeta0
Out[47]: 17.318856186448148
  # calculate standard error for beta1
In [48]: seBeta1 = sigma / np.sqrt(ssx)
In [49]: seBeta1
Out[49]: 81.785880366042491

The standard error of the parameter measures the precision of the estimate of the parameter.

The smaller the standard error, the more precise the estimate.

Hypothesis testing

Hypothesis testing is concerned with making decisions using data.

A null hypothesis is specified that represents the status quo, usually labeled H0.

The null hypothesis is assumed true and statistical evidence is required to reject it in favour of an alternative hypothesis.

one-sample-t-test

Consider testing H0: mu = mu0
If we take the set of all possible values for which you fail to reject H0, this set is an alfa% confidence interval for mu, alfa depending on the set.

Getting the T-values

Testing for null hypotesis H0: estimated beta0 and beta1 are equal to real coefficients

Dividing the parameter by its standard error calculates a t-value:

In [50]: tBeta0 = beta0 / seBeta0
In [51]: tBeta0
Out[51]: -14.990938454394572

In [52]: tBeta1 = beta1 / seBeta1
In [53]: tBeta1
Out[53]: 45.497154703190581

P-values

P-values are the most common measure of “statistical significance”.

The P-value is the probability under the null hypothesis of obtaining evidence as extreme or more extreme than would be observed by chance alone.
If the p-value is small, then either H0 is true and we have observed an extreme rare event or H0 is false.

Let’s say a P-value is 0.1: then the probability of seeing evidence as extreme or more extreme than what actually has been obtained under H0 is 0.1 (10%).

By reporting a P-value, any observer can perform the own hypothesis test at whatever alfa level they choose. If the P-value is less than alfa then you reject the null hypothesis.

pvalue

Estimating the P-values for hypotesis contrary beta0 is not equal to zero

We can use the T-distribution module from SciPy.stats to calculate the p-values

from scipy.stats import t

The survival function of a random variable is the probability that the random variable is bigger than a value x.
The SciPy.stats function sf() returns this probability:

In [54]: degreesOfFreedom = simpleModel.df_resid # The residual degree of freedom
In [55]: pBeta0 = t.sf(abs(tBeta0), df=degreesOfFreedom)*2 # two-sided
In [56]: pBeta1 = t.sf(abs(tBeta1), df=degreesOfFreedom)*2

Let’s summarise the values calculated until now:

print ("## Estimate Std. Error t-value p-value")
print ("Intercept: ", beta0, seBeta0, tBeta0, pBeta0)
print ("Carats: ", beta1, seBeta1, tBeta1, pBeta1)

##            Estimate     Std. Error    t-value        p-value
Intercept:  -259.625907192 17.3188561864 -14.9909384544 2.52327062853e-19
Carats:      3721.02485155 81.785880366 45.4971547032 6.75125983495e-40

If the p-value is less than the significance level (0.05 in our case) then the model explains the variation in the response.

T confidence intervals

For small samples we can use the t-distribution to calculate the confidence intervals.

The t-distribution has been invented by William Gosset in 1908 and is indexed by a degrees of freedom (df): it gets more like a standard normal as df gets larger.

ppf() is the Percent Point Function from SciPy.stats, that has as input the quantile (for a 2-sided 95% probability) and the degrees of freedom.

In [57]: alpha=0.05 # confidence interval for two-sided hypothesis
In [58]: qt = 1 - (alpha/2) # =0.975 for a 2-sided 95% probability
In [59]: t_value = t.ppf(qt, df=degreesOfFreedom) 
In [60]: t_value
Out[60]: 2.0128955952945886

Now we can calculate the intervals for beta0 and beta1:

In [61]: limits=[-1,1]
In [62]: [beta0 + i*t_value*seBeta0 for i in limits]
Out[62]: [-294.48695652479677, -224.76485785831295]

In [63]: [beta1 + i*t_value*seBeta1 for i in limits]
Out[63]: [3556.3984132043752, 3885.6512898965693]

Interpretation: With 95% confidence, we estimate that 1 carat increase in diamond size results in a 3556 to 3886 increase in price in (Singapore) dollars.

Plot the confidence interval

We calculate the interval for each x value; will use the isf() function to get the inverse survival function:

predicted = simpleModel.fittedvalues
x_1 = simpleModel.model.exog # just the x values plus column of 1
  # get standard deviation of predicted values
predvar = simpleModel.mse_resid + 
           (x_1 * np.dot(simpleModel.cov_params(), x_1.T).T).sum(1)
predstd = np.sqrt(predvar)

tppf = t.isf(alpha/2.0, simpleModel.df_resid)

interval_u = predicted + tppf * predstd
interval_l = predicted - tppf * predstd

fig, ax = plt.subplots()
ax.plot(x,y, 'o', label="data")
ax.plot(x, simpleModel.fittedvalues, 'g-', label="OLS")

ax.plot(x, interval_u, 'c--', label = "Intervals")
ax.plot(x, interval_l, 'c--')
  # pretty-up the plot
fig.suptitle("OLS Linear Regression with confidence intervals")
ax.set_ylabel('Predicted Price [SIN $]')
ax.set_xlabel('Weight [Carat]')
ax.grid(True)
ax.legend(loc='best')

conf

Summary of statistic values

The statsmodel package offers an overview of the model values, similar to what we calculated above:

In [64]: simpleModel.summary()

Out[64]: OLS Regression Results

Dep. Variable:

price

R-squared:

0.978

Model:

OLS

Adj. R-squared:

0.978

Method:

Least Squares

F-statistic:

2070.

Date:

Sat, 25 Aug 2014

Prob (F-statistic):

6.75e-40

Time:

13:50:30

Log-Likelihood:

-233.20

No. Observations:

48

AIC:

470.4

Df Residuals:

46

BIC:

474.1

Df Model:

1

Covariance Type:

nonrobust

coef

std err

t

P>|t|

[95.0% Conf. Int.]

const

-259.6259

17.319

-14.991

0.000

-294.487 -224.765

carats

3721.0249

81.786

45.497

0.000

3556.398 3885.651

Omnibus:

0.739

Durbin-Watson:

1.994

Prob(Omnibus):

0.691

Jarque-Bera (JB):

0.181

Skew:

0.056

Prob(JB):

0.913

Kurtosis:

3.280

Cond. No.

18.5

Many values can also be accessed directly, for example the standard errors:

In [65]: simpleModel.bse
Out[65]: const     17.318856
         carats    81.785880

You can see all the values available using dir():

dir(simpleModel)

As an exercise, you can use the more complete diamonds data available on the same JSE website which has additional features and see if adding variables such as the diamond colour help to get a more precise prediction.

 

Advertisements

One thought on “Inference statistics for linear regression

  1. Pingback: Features selection for linear regression – Look back in respect

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s