“Why” and “What” is Logistic Regression

Vignesh Kathirkamar
Analytics Vidhya
Published in
4 min readAug 27, 2020

--

Introduction

Logistic regression is one of the statistical models which is mainly used for classification purpose. Let’s understand what is logistic regression and how it would be useful for you in the field of data science and transportation in the following lines.

Why Logistic Regression?

Logistic Regression is primarily used as a method for classification. Most of us would have been aware of Linear Regression which is used in forecasting a data based on the previous data. For example: forecasting traffic data, finding growth rate of different mode of transportation within a city, etc.

But think of a situation where you need to find out whether a user will select a Bus or Train, Car or Two wheeler, whether a transport user is male or female, etc. In those cases logistic regression comes handy. We can use logistic regression for classification (Do remember it is a misnomer that many articles and books say that “though named regression, logistic regression is an classification algorithm”. Find more details here)

The Sigmoid Function

Transport Pythonified, Vignesh Kathirkamar
Sigmoid function

The sigmoid function gives output between 0 and 1 irrespective of the input given. Remember it is hard to design a linear regression whose output value will be only between 0 and 1. The equation of sigmoid function is given by σ(z) = 1/(1+e^(-z)) as shown in the figure on the left

Now think of fitting our linear model on this logistic curve, as shown in the above figure. Now all our values on linear regression line will be converted into a corresponding logistic curve value, which ranges between 0 and 1

Thus we will get a probability value as our output. So, lets make a threshold of 0.5 to classify between the choices. Hence if our output is below 0.5 then we will classify the choice as 0 else the choice will be 1.

Let’s say we have dummy coded the user selecting bus as 0 and a car as 1, then any output value which is below 0.5 will give the output as “user selects bus” , if output happens to be above 0.5 then the output will be “user selects car”

Model Evaluation:

When you say that, you have created a model that predicts a choice to your friends, They will be curious about how exactly your model will predict the user’s choice, isn’t it?. Every model has it’s own evaluation metrics. For example, Linear regression has metrics like R², MSE, RMSE and MAE. In a similar way we have something called confusion matrix

Let me explain the concept of confusion matrix with a example of pregnancy test. Let’s say 165 people (both men and women) participated in a pregnancy test which is going to use the logistic regression you have created to carry out the test. Find the following Table along with the basic terminology which are self explanatory

From the above figure we can find that the number of people participated were 165, Actual number of people pregnant were 105 and non pregnant were 60. But our model predicted 110 people as pregnant and 55 as non pregnant.

False Positives are also known as Type 1 error and False negative are know as Type 2 error. So as you have seen the picture to the left, you now have better understanding of what the model was predicting.

Now that we have understood the basic terms of True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) let’s see how accuracy and misclassification rate of a model is estimated.

Accuracy = (TP+TN)/Total samples

Misclassification Rate = (FP+FN)/Total samples

Conclusion:

Now since we have understood what a logistic regression is and how it can be used for predicting the choice, we can apply it in transportation field for predicting mode choice, predicting the probability of accident based on various parameters. Having a basic idea of logistic regression helps in machine learning projects also.

If this was useful for you don’t forget to give two claps :)

Regards,

Vignesh Kathirkamar

Transport Pythonified

--

--