Inference statistics for linear regression

We have seen how we can fit a model to existing data using linear regression. Now we want to assess how well the model describes those data points (every Outcome = Model + Error) and will use some statistics for it.

The following code is also available as a Notebook in GitHub.

As an example we access some available diamond ring data (from the Journal of Statistics Education): prices in Singapore dollars and weights in carats (the standard measure of diamond mass, equal to 0.2 g). 

import pandas as pd

diamondData = pd.read_csv("diamond.dat.txt", delim_whitespace=True,
                          header=None, names=["carats","price"])
In [1]: diamondData.head()
Out[1]:

carats

price

0

0.17

355

1

0.16

328

2

0.17

350

3

0.18

325

4

0.25

642

Fit a model

Is there a relationship between the diamond price and its weight?

Our first goal should be to determine whether the data provide evidence of an association between price and carats. If the evidence is weak, then one might argue that bigger diamonds are not better!

To evaluate the model we will use a special Python package, statsmodel, which is a package based on the original statistics module of SciPy (Scientific Python) by Jonathan Taylor – later removed – corrected, improved, tested and released as a new package during the Google Summer of Code 2009.

statsmodels_hybi_banner

Since statsmodels offers also functions to fit a linear regression model, we do not need to import and use sklearn to fit the model but we can do everything with statsmodels. Continue reading “Inference statistics for linear regression”

Advertisements