Logistic regression with Python statsmodels

We have seen an introduction of logistic regression with a simple example how to predict a student admission to university based on past exam results.
This was done using Python, the sigmoid function and the gradient descent.

We can now see how to solve the same example using the statsmodels library, specifically the logit package, that is for logistic regression. The package contains an optimised and efficient algorithm to find the correct regression parameters.
You can follow along from the Python notebook on GitHub.

The initial part is exactly the same: read the training data, prepare the target variable.
Then, we’re going to import and use the statsmodels Logit function:

```import statsmodels.formula.api as sm

model = sm.Logit(y, X)

result = model.fit()
```
```Optimization terminated successfully.
Current function value: 0.203498
Iterations 9
```
` result.summary()`
Dep. Variable: No. Observations: Admitted 100 Logit 97 MLE 2 Tue, 18 Jul 2017 0.6976 15:06:33 -20.350 True -67.301 4.067e-21
coef std err z P>|z| [95.0% Conf. Int.]
Exam1 0.2062 0.048 4.296 0.000 0.112 0.300
Exam2 0.2015 0.049 4.143 0.000 0.106 0.297
intercept -25.1613 5.799 -4.339 0.000 -36.526 -13.796

You get a great overview of the coefficients of the model, how well those coefficients fit, the overall fit quality, and several other statistical measures.
The result object also lets you to isolate and inspect parts of the model output, for example the coefficients are in params field:

```coefficients = result.params
coefficients```
```Exam1      0.206232
Exam2      0.201472
intercept -25.161334```

As you see, the model found the same coefficients as in the previous example.

The confidence interval gives you an idea for how robust the coefficients of the model are.

`result.conf_int()`
```Exam1      0.112152    0.300311
Exam2      0.106168    0.296775
intercept -36.526287 -13.796380
```

Note: this post is part of a series about Machine Learning with Python.

15 thoughts on “Logistic regression with Python statsmodels”

1. Solomon

This is great. But I have issue with my result, the coefficients failed to converged after 35 iterations. How can I increase the number of iterations? Also, I’m working with a complex design survey data, how do I include the sampling unit and sapling weight in the model?

1. Anonymous

2. devender kumar

import statsmodels.formula.api as sm
X=data_final.loc[:,data_final.columns!=target]
y=data_final.loc[:,target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = sm.Logit(endog=y_train,exog= X_train)
result = model.fit()

0 1
Delay_bin 0.992853 1.068759
LIMIT_BAL_bin 0.282436 0.447070
Avg_Use_bin 0.151494 0.353306
Tot_percpaid_bin 0.300069 0.490454
Edu -0.278094 0.220439
Age_bin 0.169336 0.732283

3. Anonymous

What does MLE stands for? Is it Maximum Likelihood Estimation

1. mashimo

Yes, you are correct.

4. What is the definition of “current function value” ?

1. mashimo

In this case is the final cost minimised after n iterations (cost being – in short – the difference between the predictions and the actual labels).
I think that statsmodels internally uses the scipy.optimize.minimize() function to minimise the cost function and that method is generic, therefore the verbose logs just say “function value”.

5. Hi you have a wonderful Posting site It was very easy to post good job

6. Hi you have a user friendly site It was very easy to post I enjoyed your site

7. SinSin

How is the y defined ?

1. mashimo

Each student has a final admission result (1=yes, 0= no).
Basically y is a logical variable with only two values.