We have seen an introduction of logistic regression with a simple example how to predict a student admission to university based on past exam results.

This was done using Python, the sigmoid function and the gradient descent.

We can now see how to solve the same example using the *statsmodels* library, specifically the *logit* package, that is for logistic regression. The package contains an optimised and efficient algorithm to find the correct regression parameters.

You can follow along from the Python notebook on GitHub.

The initial part is exactly the same: read the training data, prepare the target variable.

Then, we’re going to import and use the *statsmodels* *Logit* function:

import statsmodels.formula.api as sm model = sm.Logit(y, X) result = model.fit()

Optimization terminated successfully. Current function value: 0.203498 Iterations 9

result.summary()

Dep. Variable: | Admitted | No. Observations: | 100 |
---|---|---|---|

Model: | Logit | Df Residuals: | 97 |

Method: | MLE | Df Model: | 2 |

Date: | Tue, 18 Jul 2017 | Pseudo R-squ.: | 0.6976 |

Time: | 15:06:33 | Log-Likelihood: | -20.350 |

converged: | True | LL-Null: | -67.301 |

LLR p-value: | 4.067e-21 |

coef | std err | z | P>|z| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|

Exam1 | 0.2062 | 0.048 | 4.296 | 0.000 | 0.112 0.300 |

Exam2 | 0.2015 | 0.049 | 4.143 | 0.000 | 0.106 0.297 |

intercept | -25.1613 | 5.799 | -4.339 | 0.000 | -36.526 -13.796 |

You get a great overview of the coefficients of the model, how well those coefficients fit, the overall fit quality, and several other statistical measures.

The result object also lets you to isolate and inspect parts of the model output, for example the coefficients are in params field:

coefficients = result.params coefficients

Exam1 0.206232 Exam2 0.201472 intercept -25.161334

As you see, the model found the same coefficients as in the previous example.

The confidence interval gives you an idea for how robust the coefficients of the model are.

result.conf_int()

Exam1 0.112152 0.300311 Exam2 0.106168 0.296775 intercept -36.526287 -13.796380

Note: this post is part of a series about Machine Learning with Python.

This is great. But I have issue with my result, the coefficients failed to converged after 35 iterations. How can I increase the number of iterations? Also, I’m working with a complex design survey data, how do I include the sampling unit and sapling weight in the model?

Please ignore the errors

I am not getting intercept in the model? Please help

import statsmodels.formula.api as sm

X=data_final.loc[:,data_final.columns!=target]

y=data_final.loc[:,target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = sm.Logit(endog=y_train,exog= X_train)

result = model.fit()

0 1

Delay_bin 0.992853 1.068759

LIMIT_BAL_bin 0.282436 0.447070

Avg_Use_bin 0.151494 0.353306

Tot_percpaid_bin 0.300069 0.490454

Edu -0.278094 0.220439

Age_bin 0.169336 0.732283

Pingback: Classification metrics and Naive Bayes – Look back in respect

What does MLE stands for? Is it Maximum Likelihood Estimation

Yes, you are correct.

What is the definition of “current function value” ?

In this case is the final cost minimised after n iterations (cost being – in short – the difference between the predictions and the actual labels).

I think that

statsmodelsinternally uses thescipy.optimize.minimize()function to minimise the cost function and that method is generic, therefore the verbose logs just say “function value”.Hi you have a wonderful Posting site It was very easy to post good job

Pingback: Multi-class logistic regression – Look back in respect

Hi you have a user friendly site It was very easy to post I enjoyed your site

Pingback: Logistic regression using SKlearn – Look back in respect

How is the y defined ?

Each student has a final admission result (1=yes, 0= no).

Basically y is a logical variable with only two values.

Pingback: An introduction to logistic regression – Look back in respect