Logistic Regression

Table of Contents

A classification technique which computes the probability or likelihood that the data belongs to a class of interest. $$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n$$

Goal : Maximize likelihood

Optimization Method : Gradient Descent

Generally fits an ‘S’ curve (‘S’ shaped logistic function) for the data to belong to separate classes.

Binary Classification
#

Sigmoid transforms the linear combination of inputs into probability. $$\sigma(z) = \frac{1}{1+e^{-z}}$$

$z = \beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n$

The curve which gives the maximum likelihood is selected as the end result.

Logistic Regression - Binary Classification

Multinomial Logistic Regression
#

One-vs-Rest (OvR) Classification :

For $K$ classes, train $K$ separate binary classifiers (one class vs the rest).
For a new observation, output $K$ probabilities and select the class with the highest probability.
Different binary classifiers may produce overlapping decision regions which may lead to ambiguous classification, especially when classes are imbalanced.

Softmax Logistic Function :

Softmax ensures the predicted probabilities of all classes sum to 1. $$P(y=k|x)=\frac{e^{x^T\beta_{k}}}{\sum_{j=1}^Ke^{x^T\beta_{j}}}$$

$x$ = feature vector
$K$ = number of classes
$\beta_{k}$ = parameter vector for class $k$
Provides a coherent probability model.
Model coefficients can be interpreted as the change in log-odds of being in a given class.

Maximum Likelihood Estimation
#

Binary Classification : $$L(\beta) = \prod_{i=1}^{N} p_i^{y_i} (1-p_i)^{1-y_i}$$

$p_i$ = predicted probability for observation (\phi)
$y_i$ = actual label (0 or 1)

Log Transformation (Log-Likelihood) :

This results in underflow for large datasets since all values are between 0 and 1.
We apply log transformation to change it into a summation. $$\ell(\beta)=\sum_{i=1}^{N}(y_i \log(p_i) + (1-y_i) \log(1-p_i))$$

Multi-Classification using Softmax : $$\ell({\beta_{k}}) = \sum_{i=1}^{N} \sum_{k=1}^{K} y_{ik} \log\left( \frac{e^{x_i^\top \beta_{k}}}{\sum_{j=1}^{K} e^{x_i^\top \beta_j}} \right)$$

Assumptions
#

An S shaped curve is assumed to fit the data. The independent variable should have strong correlation with the likelihood of the class.
Linearity in the Logit : The log-odds (logits) of the outcome are assumed to be a linear combination of the independent variables. $$\log\left( \frac{p}{1-p} \right)=\beta_{0}+\beta_{1}x_{1}+\dots+\beta_{n}x_{n}$$
- The coefficients are in the log-odds scale. All Linear Regression tests can be done on this line.
Independence : Observations are assumed to be independent of each other.
No or Little Multicollinearity : The independent variables should not be highly correlated with each other.
A sufficiently large sample is preferred to obtain reliable estimates.

Limitations
#

Linear Decision Boundary in Logit Space : Assuming a linear relationship between the predictors and the log-odds of the outcome, might not capture more complex, non-linear relationships.
Sensitivity to Outliers : Extreme values can disproportionately affect the model.
Multicollinearity : High correlation among independent variables can destabilize coefficient estimates.
Feature transformations : For data that is not linearly separable (in the logit domain), performance may suffer unless features are transformed or interaction terms are added.

Diagnostic Decision Table
#

Observation	Implication	Action
Poor Overall Model Fit : High deviance, low pseudo R-squared, or significant lack-of-fit (e.g., failing the Hosmer-Lemeshow test).	The model may be mis-specified, missing important predictors, or not capturing the true relationship between variables.	Reassess model specification: consider adding relevant predictors, interaction terms, or non-linear transformations.
Non-significant Coefficients : Predictors with high p-values or wide confidence intervals.	These predictors may not contribute meaningfully to predicting the outcome, possibly diluting the model’s effectiveness.	Remove or re-specify insignificant variables, and ensure that only meaningful predictors remain in the model.
Low Predictive Performance : Low classification accuracy, poor ROC-AUC, or imbalanced confusion matrix (e.g., high false positives/negatives).	The model might be underfitting, failing to capture important data patterns, or struggling with class imbalance.	Perform feature engineering, consider re-balancing the dataset, explore alternative model specifications, or try regularization techniques.
High Multicollinearity : Indicators such as high Variance Inflation Factor (VIF) values or inflated standard errors.	Predictor variables are highly correlated, leading to unstable coefficient estimates and difficulty in interpreting their individual effects.	Remove, combine, or transform correlated predictors, or apply regularization (L1/L2) to mitigate multicollinearity issues.
Influential Observations : A few data points exhibit high Cook’s distance or leverage values.	These observations may disproportionately affect model estimates, potentially skewing results and interpretations.	Investigate and validate these outliers. Consider robust regression techniques or, if justified, remove problematic observations after careful analysis.
Patterned Residuals : Residual plots show systematic patterns or non-random distribution.	This suggests that the model is missing key relationships (e.g., interactions or non-linear effects), indicating mis-specification.	Consider adding interaction terms, polynomial terms, or applying variable transformations to better capture the underlying data structure.
Poor Calibration in Multi-Class Models (OvR approach) : Inconsistent or poorly calibrated predicted probabilities across classes.	The individual binary classifiers may not provide directly comparable probabilities, potentially leading to ambiguous class assignments.	For multi-class problems, consider switching to multinomial (softmax) logistic regression, applying probability calibration techniques, or using methods to better balance class predictions.

Steps
#

Goal : Maximize likelihood

Model : $\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n$

Compute the likelihood for each observation.
Apply the log transformation to obtain the log-likelihood.
Optimize the parameters using Gradient Descent.
Calculate evaluation metrics and diagnose.
Use Wald’s Test to determine whether a feature is useful in computing the prediction.
Validate model assumptions such as linearity and independence.