Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a fundamental method in statistical inference used to estimate the parameters of a statistical model. It is widely used in data science and machine learning because of its theoretical properties and practical applicability. This article dives deep into the principles of MLE, explores its properties, and provides detailed examples of how to apply it to estimate parameters in various statistical models.

Understanding Maximum Likelihood Estimation (MLE)

What is MLE?

Maximum Likelihood Estimation is a method for estimating the parameters of a statistical model by finding the parameter values that maximize the likelihood function. The likelihood function measures how plausible the observed data is given a set of parameter values.

Mathematically, if we have a statistical model with a parameter $\theta$ and a set of observed data $X = \{x_1, x_2, \dots, x_n\}$ , the likelihood function is defined as:

L(\theta; X) = P(X | \theta)

The goal of MLE is to find the parameter value $\hat{\theta}$ that maximizes the likelihood function:

\hat{\theta} = \underset{\theta}{\text{argmax}} \, L(\theta; X)

Assumptions:

Independence and Identical Distribution (i.i.d.): The data points ( x_1, x_2, \dots, x_n ) are assumed to be independent and identically distributed. This allows the likelihood function to factor into the product of individual probabilities.

Likelihood vs. Probability

It’s important to distinguish between likelihood and probability:

Probability ( $P(X | \theta)$ ): The probability of observing the data $X$ given specific parameter values $\theta$ . It is used to predict future data based on the model.
Likelihood ( $L(\theta | X)$ ): A function of the parameters $\theta$ given the observed data $X$ . It represents how plausible the parameters are in explaining the observed data.

While probability is used to predict future data given a model, likelihood is used to infer model parameters from observed data.

Log-Likelihood

To simplify the maximization process, especially when dealing with products of probabilities, it’s common to work with the log-likelihood function. The log-likelihood is the natural logarithm of the likelihood function:

\ell(\theta; X) = \log L(\theta; X) = \sum_{i=1}^n \log P(x_i | \theta)

Maximizing the log-likelihood function $\ell(\theta; X)$ gives the same result as maximizing the likelihood function $L(\theta; X)$ , but with the advantage of turning products into sums, making the calculations easier and numerically more stable.

Properties of Maximum Likelihood Estimators

MLE has several desirable properties that make it a powerful method in statistical inference:

Consistency:
- As the sample size increases, the MLE converges to the true parameter value. This means that with enough data, the MLE will give an accurate estimate of the parameter.
Asymptotic Normality:
- For large samples, the distribution of the MLE is approximately normal (Gaussian) with a mean equal to the true parameter value and a variance equal to the inverse of the Fisher information. This allows for the construction of confidence intervals and hypothesis tests.
Efficiency:
- Among all unbiased estimators, the MLE has the smallest possible variance (Cramér-Rao lower bound), making it the most efficient estimator under regular conditions.
Invariance:
- If $\hat{\theta}$ is the MLE for $\theta$ , and $\phi = g(\theta)$ is a function of $\theta$ , then the MLE for $\phi$ is $g(\hat{\theta})$ . This property makes MLEs easy to work with when transforming parameters.
Invariance Example:

Suppose $\hat{\theta}$ is the MLE for $\theta$ , and you are interested in estimating $\phi = \theta^2$ . According to the invariance property, the MLE for $\phi$ is $\hat{\phi} = \hat{\theta}^2$ .
Regularity Conditions:
- These properties hold under certain regularity conditions, such as the existence of the first and second derivatives of the log-likelihood function and the parameter space being open. These conditions ensure that the mathematical derivations leading to the properties are valid.

Applying MLE to Estimate Parameters

Let’s explore how to apply MLE to estimate parameters in various statistical models through detailed examples.

Example 1: Estimating the Mean of a Normal Distribution

Problem Setup

Suppose you have a set of data points that you believe are drawn from a normal distribution with an unknown mean $\mu$ and a known variance $\sigma^2 = 4$ . Your goal is to estimate the mean $\mu$ using MLE.

Step 1: Write Down the Likelihood Function

The likelihood function for the normal distribution is given by:

L(\mu; X) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)

Given that $\sigma^2$ is known and constant, we can focus on the part of the likelihood that depends on $\mu$ :

L(\mu; X) = \prod_{i=1}^n \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)

Step 2: Simplify Using the Log-Likelihood

Taking the logarithm of the likelihood function to get the log-likelihood:

\ell(\mu; X) = \log L(\mu; X) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2

The first term is constant with respect to $\mu$ , so we can ignore it when maximizing. Focus on the second term:

\ell(\mu; X) = -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2

Step 3: Maximize the Log-Likelihood

To find the value of $\mu$ that maximizes the log-likelihood, take the derivative with respect to $\mu$ and set it to zero:

\frac{d\ell(\mu; X)}{d\mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = 0

This simplifies to:

\sum_{i=1}^n (x_i - \mu) = 0

Which further simplifies to:

\mu = \frac{1}{n} \sum_{i=1}^n x_i

Thus, the MLE for $\mu$ is the sample mean:

\hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i

Step 4: Interpretation

The MLE for the mean $\mu$ of a normal distribution with known variance is simply the arithmetic mean of the observed data. This result is intuitive and aligns with our understanding that the sample mean is a good estimator for the population mean.

Example 2: Estimating the Success Probability in a Binomial Distribution

Problem Setup

Suppose you are running an experiment where you flip a coin $n$ times and observe $k$ heads. You want to estimate the probability $p$ of getting heads using MLE.

Step 1: Write Down the Likelihood Function

The likelihood function for a Binomial distribution, where $k$ successes are observed out of $n$ trials, is given by:

L(p; X) = \binom{n}{k} p^k (1-p)^{n-k}

Step 2: Simplify Using the Log-Likelihood

Taking the logarithm of the likelihood function to get the log-likelihood:

\ell(p; X) = \log L(p; X) = \log \binom{n}{k} + k \log(p) + (n-k) \log(1-p)

Again, the first term is constant with respect to $p$ , so we can focus on the second and third terms:

\ell(p; X) = k \log(p) + (n-k) \log(1-p)

Step 3: Maximize the Log-Likelihood

To find the value of $p$ that maximizes the log-likelihood, take the derivative with respect to $p$ and set it to zero:

\frac{d\ell(p; X)}{dp} = \frac{k}{p} - \frac{n-k}{1-p} = 0

Solving for $p$ :

\frac{k}{p} = \frac{n-k}{1-p}

This simplifies to:

p = \frac{k}{n}

Thus, the MLE for $p$ is the sample proportion:

\hat{p} = \frac{k}{n}

Step 4: Interpretation

The MLE for the probability of success $p$ in a binomial distribution is the observed proportion of successes. This result makes intuitive sense and is widely used in estimating probabilities from binary data.

Example 3: Estimating the Rate Parameter of a Poisson Distribution

Problem Setup

Suppose you are observing the number of events that occur in each of $n$ fixed intervals of time, and you believe the number of events in each interval follows a Poisson distribution with an unknown rate parameter $\lambda$ . You observe $k$ events in total across all intervals. You want to estimate $\lambda$ using MLE.

Step 1: Write Down the Likelihood Function

The likelihood function for the Poisson distribution, where $x_i$ events are observed in each interval, is given by:

L(\lambda; X) = \prod_{i=1}^n \frac{\lambda^{x_i} e^{-\lambda}}{x_i!}

Where $x_i$ is the number of events observed in each interval.

Step 2: Simplify Using the Log-Likelihood

Taking the logarithm of the likelihood function to get the log-likelihood:

\ell(\lambda; X) = \sum_{i=1}^n \left( x_i \log(\lambda) - \lambda - \log(x_i!) \right)

Ignoring the constant term $\sum_{i=1}^n \log(x_i!)$ , the log-likelihood simplifies to:

\ell(\lambda; X) = \sum_{i=1}^n \left( x_i \log(\lambda) - \lambda \right)

Step 3: Maximize the Log-Likelihood

To find the value of $\lambda$ that maximizes the log-likelihood, take the derivative with respect to $\lambda$ and set it to zero:

\frac{d\ell(\lambda; X)}{d\lambda} = \sum_{i=1}^n \left( \frac{x_i}{\lambda} - 1 \right) = 0

Simplifying this:

\sum_{i=1}^n \frac{x_i}{\lambda} = n

\lambda = \frac{1}{n} \sum_{i=1}^n x_i

Thus, the MLE for $\lambda$ is the sample mean of the observed data:

\hat{\lambda} = \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i

Step 4: Interpretation

The MLE for the rate parameter $\lambda$ of a Poisson distribution is the average number of events observed per interval. This makes intuitive sense, as the sample mean is the best estimate of the expected number of events in each interval.

Conclusion

Maximum Likelihood Estimation (MLE) is a powerful and versatile method for estimating the parameters of statistical models. By maximizing the likelihood function, MLE provides parameter estimates that are consistent, efficient, and asymptotically normal, making it a cornerstone of statistical inference.

In this article, we explored the principles of MLE, discussed its important properties, and demonstrated how to apply it to estimate parameters in various statistical models, including the normal, binomial, and Poisson distributions. Through detailed examples, we showed how MLE works in practice and how it can be used to derive meaningful parameter estimates from data.

Understanding and applying MLE is essential for data scientists and statisticians, as it forms the basis for many advanced techniques in machine learning, econometrics, and beyond. By mastering MLE, you can enhance your ability to build and interpret statistical models, leading to better insights and more informed decisions based on data.

Understanding Maximum Likelihood Estimation (MLE)​

What is MLE?​

Assumptions:​

Likelihood vs. Probability​

Log-Likelihood​

Properties of Maximum Likelihood Estimators​

Applying MLE to Estimate Parameters​

Example 1: Estimating the Mean of a Normal Distribution​

Problem Setup​

Step 1: Write Down the Likelihood Function​

Step 2: Simplify Using the Log-Likelihood​

Step 3: Maximize the Log-Likelihood​

Step 4: Interpretation​

Example 2: Estimating the Success Probability in a Binomial Distribution​

Problem Setup​

Step 1: Write Down the Likelihood Function​

Step 2: Simplify Using the Log-Likelihood​

Step 3: Maximize the Log-Likelihood​

Step 4: Interpretation​

Example 3: Estimating the Rate Parameter of a Poisson Distribution​

Problem Setup​

Step 1: Write Down the Likelihood Function​

Step 2: Simplify Using the Log-Likelihood​

Step 3: Maximize the Log-Likelihood​

Step 4: Interpretation​

Conclusion​

Understanding Maximum Likelihood Estimation (MLE)

What is MLE?

Assumptions:

Likelihood vs. Probability

Log-Likelihood

Properties of Maximum Likelihood Estimators

Applying MLE to Estimate Parameters

Example 1: Estimating the Mean of a Normal Distribution

Problem Setup

Step 1: Write Down the Likelihood Function

Step 2: Simplify Using the Log-Likelihood

Step 3: Maximize the Log-Likelihood

Step 4: Interpretation

Example 2: Estimating the Success Probability in a Binomial Distribution

Problem Setup

Step 1: Write Down the Likelihood Function

Step 2: Simplify Using the Log-Likelihood

Step 3: Maximize the Log-Likelihood

Step 4: Interpretation

Example 3: Estimating the Rate Parameter of a Poisson Distribution

Problem Setup

Step 1: Write Down the Likelihood Function

Step 2: Simplify Using the Log-Likelihood

Step 3: Maximize the Log-Likelihood

Step 4: Interpretation

Conclusion