Logistic Regression Theory

Logistic regression is fundamentally a classification algorithm that models the probability of a binary outcome (0 or 1) using a logistic function. It transforms a linear equation into a probability score using the logistic (sigmoid) function, allowing us to classify data points.

In this section, we will cover:

The logistic (sigmoid) function.
The log-odds and decision boundary.
The cost function (log-likelihood).
Optimization through gradient descent.

1. Logistic (Sigmoid) Function

At the core of logistic regression is the sigmoid function, which takes any real-valued input and maps it to a value between 0 and 1. This is crucial because we are modeling probabilities, and probabilities must always lie within this range.

The sigmoid function is defined as:

\sigma(z) = \frac{1}{1 + e^{-z}}

Where:

$z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n$ is the linear combination of the input features $x_1, x_2, \dots, x_n$ with the coefficients $\beta_0, \beta_1, \dots, \beta_n$ .

The output of the sigmoid function is a value between 0 and 1, which can be interpreted as the probability that the input belongs to the positive class ( $y = 1$ ).

For example:

If $\sigma(z) = 0.8$ , there is an 80% chance that the input belongs to class 1.
If $\sigma(z) = 0.2$ , there is only a 20% chance that the input belongs to class 1.

The sigmoid function is crucial because it introduces nonlinearity, allowing logistic regression to handle classification problems where the decision boundary is not a simple linear threshold.

2. Log-Odds and Decision Boundary

Log-Odds

Logistic regression predicts the log-odds of the binary outcome as a linear function of the input variables. The log-odds (also called the logit function) is defined as:

\text{log-odds}(p) = \ln \left( \frac{p}{1-p} \right)

Where:

$p$ is the probability that the dependent variable equals 1 ( $y=1$ ).

The log-odds transform the probability into a continuous value that can range from $-\infty$ to $+\infty$ . Logistic regression models the log-odds linearly as:

\ln \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n

This allows us to use a linear model to predict the log-odds, which is then converted back into a probability using the sigmoid function.

Decision Boundary

The decision boundary is where we apply a threshold to classify an observation. By default, the threshold is 0.5, meaning that:

If $P(y=1 | X) \geq 0.5$ , we classify the input as class 1.
If $P(y=1 | X) < 0.5$ , we classify the input as class 0.

Mathematically, this decision boundary is defined by:

\beta_0 + \beta_1 x_1 + \dots + \beta_n x_n = 0

At this boundary, the probability of the two classes is equal, and the input could be classified as either 0 or 1.

3. Cost Function: Log-Likelihood

In logistic regression, we use maximum likelihood estimation (MLE) to find the values of the coefficients that maximize the likelihood of observing the given data. The cost function, also called the log-likelihood, measures how well the model's predicted probabilities align with the actual outcomes.

The log-likelihood for logistic regression is:

\ell(\beta) = \sum_{i=1}^{n} \left[ y_i \ln(P(y_i=1 | X_i)) + (1 - y_i) \ln(1 - P(y_i=1 | X_i)) \right]

Where:

$y_i$ is the actual label for the $i$ -th observation (either 0 or 1).
$P(y_i=1 | X_i)$ is the predicted probability that the $i$ -th observation belongs to class 1, which is computed using the sigmoid function.

This function computes the log-likelihood for each observation, then sums them up. The goal of logistic regression is to maximize this log-likelihood, meaning we want to find the coefficient values ( $\beta_0, \beta_1, \dots, \beta_n$ ) that make the observed data most likely.

Binary Cross-Entropy Loss

In practice, we often minimize the negative log-likelihood, which is equivalent to the binary cross-entropy loss function:

\text{Cost} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \ln(P(y_i=1 | X_i)) + (1 - y_i) \ln(1 - P(y_i=1 | X_i)) \right]

This cost function penalizes incorrect predictions more heavily, ensuring that the model updates its coefficients in the right direction during training.

4. Optimization: Gradient Descent

Since there is no closed-form solution for the coefficients in logistic regression, we use an iterative optimization algorithm, such as gradient descent, to minimize the cost function.

Gradient Descent Algorithm

Gradient descent works by computing the gradient (partial derivatives) of the cost function with respect to each coefficient and updating the coefficients in the direction that reduces the cost.

The update rule for each coefficient $\beta_j$ is:

\beta_j = \beta_j - \alpha \frac{\partial}{\partial \beta_j} \text{Cost}

Where:

$\alpha$ is the learning rate, which controls the step size for each update.
$\frac{\partial}{\partial \beta_j} \text{Cost}$ is the gradient of the cost function with respect to $\beta_j$ .

The gradient tells us how much the cost function changes with a small change in $\beta_j$ . By following the negative gradient, we gradually improve the coefficients until we reach a point where the cost function is minimized.

Convergence

The algorithm continues iterating until it converges, meaning the changes in the coefficients become very small or the cost function stops decreasing. At convergence, the coefficients are optimized, and the model is ready to make predictions.

Summary

In this section, we explored the mathematical foundation of logistic regression, focusing on:

The sigmoid function, which maps linear predictions to probabilities.
Log-odds and the decision boundary for classification.
The log-likelihood and binary cross-entropy as cost functions to be maximized.
Optimization using gradient descent to find the best-fitting coefficients.

Understanding these concepts is crucial for applying logistic regression effectively and interpreting the resulting models. In the next section, we will dive into practical examples of implementing logistic regression using popular libraries like scikit-learn, TensorFlow, and PyTorch.

1. Logistic (Sigmoid) Function​

2. Log-Odds and Decision Boundary​

Log-Odds​

Decision Boundary​

3. Cost Function: Log-Likelihood​

Binary Cross-Entropy Loss​

4. Optimization: Gradient Descent​

Gradient Descent Algorithm​

Convergence​

Summary​

1. Logistic (Sigmoid) Function

2. Log-Odds and Decision Boundary

Log-Odds

Decision Boundary

3. Cost Function: Log-Likelihood

Binary Cross-Entropy Loss

4. Optimization: Gradient Descent

Gradient Descent Algorithm

Convergence

Summary