Skip to main content

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a fundamental method in statistical inference used to estimate the parameters of a statistical model. It is widely used in data science and machine learning because of its theoretical properties and practical applicability. This article dives deep into the principles of MLE, explores its properties, and provides detailed examples of how to apply it to estimate parameters in various statistical models.

Understanding Maximum Likelihood Estimation (MLE)

What is MLE?

Maximum Likelihood Estimation is a method for estimating the parameters of a statistical model by finding the parameter values that maximize the likelihood function. The likelihood function measures how plausible the observed data is given a set of parameter values.

Mathematically, if we have a statistical model with a parameter θ\theta and a set of observed data X={x1,x2,,xn}X = \{x_1, x_2, \dots, x_n\}, the likelihood function is defined as:

L(θ;X)=P(Xθ)L(\theta; X) = P(X | \theta)

The goal of MLE is to find the parameter value θ^\hat{\theta} that maximizes the likelihood function:

θ^=argmaxθL(θ;X)\hat{\theta} = \underset{\theta}{\text{argmax}} \, L(\theta; X)

Assumptions:

  • Independence and Identical Distribution (i.i.d.): The data points ( x_1, x_2, \dots, x_n ) are assumed to be independent and identically distributed. This allows the likelihood function to factor into the product of individual probabilities.

Likelihood vs. Probability

It’s important to distinguish between likelihood and probability:

  • Probability (P(Xθ)P(X | \theta)): The probability of observing the data XX given specific parameter values θ\theta. It is used to predict future data based on the model.

  • Likelihood (L(θX)L(\theta | X)): A function of the parameters θ\theta given the observed data XX. It represents how plausible the parameters are in explaining the observed data.

While probability is used to predict future data given a model, likelihood is used to infer model parameters from observed data.

Log-Likelihood

To simplify the maximization process, especially when dealing with products of probabilities, it’s common to work with the log-likelihood function. The log-likelihood is the natural logarithm of the likelihood function:

(θ;X)=logL(θ;X)=i=1nlogP(xiθ)\ell(\theta; X) = \log L(\theta; X) = \sum_{i=1}^n \log P(x_i | \theta)

Maximizing the log-likelihood function (θ;X)\ell(\theta; X) gives the same result as maximizing the likelihood function L(θ;X)L(\theta; X), but with the advantage of turning products into sums, making the calculations easier and numerically more stable.

Properties of Maximum Likelihood Estimators

MLE has several desirable properties that make it a powerful method in statistical inference:

  1. Consistency:

    • As the sample size increases, the MLE converges to the true parameter value. This means that with enough data, the MLE will give an accurate estimate of the parameter.
  2. Asymptotic Normality:

    • For large samples, the distribution of the MLE is approximately normal (Gaussian) with a mean equal to the true parameter value and a variance equal to the inverse of the Fisher information. This allows for the construction of confidence intervals and hypothesis tests.
  3. Efficiency:

    • Among all unbiased estimators, the MLE has the smallest possible variance (Cramér-Rao lower bound), making it the most efficient estimator under regular conditions.
  4. Invariance:

    • If θ^\hat{\theta} is the MLE for θ\theta, and ϕ=g(θ)\phi = g(\theta) is a function of θ\theta, then the MLE for ϕ\phi is g(θ^)g(\hat{\theta}). This property makes MLEs easy to work with when transforming parameters.

    Invariance Example:

    Suppose θ^\hat{\theta} is the MLE for θ\theta, and you are interested in estimating ϕ=θ2\phi = \theta^2. According to the invariance property, the MLE for ϕ\phi is ϕ^=θ^2\hat{\phi} = \hat{\theta}^2.

  5. Regularity Conditions:

    • These properties hold under certain regularity conditions, such as the existence of the first and second derivatives of the log-likelihood function and the parameter space being open. These conditions ensure that the mathematical derivations leading to the properties are valid.

Applying MLE to Estimate Parameters

Let’s explore how to apply MLE to estimate parameters in various statistical models through detailed examples.

Example 1: Estimating the Mean of a Normal Distribution

Problem Setup

Suppose you have a set of data points that you believe are drawn from a normal distribution with an unknown mean μ\mu and a known variance σ2=4\sigma^2 = 4. Your goal is to estimate the mean μ\mu using MLE.

Step 1: Write Down the Likelihood Function

The likelihood function for the normal distribution is given by:

L(μ;X)=i=1n12πσ2exp((xiμ)22σ2)L(\mu; X) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)

Given that σ2\sigma^2 is known and constant, we can focus on the part of the likelihood that depends on μ\mu:

L(μ;X)=i=1nexp((xiμ)22σ2)L(\mu; X) = \prod_{i=1}^n \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)

Step 2: Simplify Using the Log-Likelihood

Taking the logarithm of the likelihood function to get the log-likelihood:

(μ;X)=logL(μ;X)=n2log(2πσ2)12σ2i=1n(xiμ)2\ell(\mu; X) = \log L(\mu; X) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2

The first term is constant with respect to μ\mu, so we can ignore it when maximizing. Focus on the second term:

(μ;X)=12σ2i=1n(xiμ)2\ell(\mu; X) = -\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2

Step 3: Maximize the Log-Likelihood

To find the value of μ\mu that maximizes the log-likelihood, take the derivative with respect to μ\mu and set it to zero:

d(μ;X)dμ=1σ2i=1n(xiμ)=0\frac{d\ell(\mu; X)}{d\mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = 0

This simplifies to:

i=1n(xiμ)=0\sum_{i=1}^n (x_i - \mu) = 0

Which further simplifies to:

μ=1ni=1nxi\mu = \frac{1}{n} \sum_{i=1}^n x_i

Thus, the MLE for μ\mu is the sample mean:

μ^=xˉ=1ni=1nxi\hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i

Step 4: Interpretation

The MLE for the mean μ\mu of a normal distribution with known variance is simply the arithmetic mean of the observed data. This result is intuitive and aligns with our understanding that the sample mean is a good estimator for the population mean.

Example 2: Estimating the Success Probability in a Binomial Distribution

Problem Setup

Suppose you are running an experiment where you flip a coin nn times and observe kk heads. You want to estimate the probability pp of getting heads using MLE.

Step 1: Write Down the Likelihood Function

The likelihood function for a Binomial distribution, where kk successes are observed out of nn trials, is given by:

L(p;X)=(nk)pk(1p)nkL(p; X) = \binom{n}{k} p^k (1-p)^{n-k}

Step 2: Simplify Using the Log-Likelihood

Taking the logarithm of the likelihood function to get the log-likelihood:

(p;X)=logL(p;X)=log(nk)+klog(p)+(nk)log(1p)\ell(p; X) = \log L(p; X) = \log \binom{n}{k} + k \log(p) + (n-k) \log(1-p)

Again, the first term is constant with respect to pp, so we can focus on the second and third terms:

(p;X)=klog(p)+(nk)log(1p)\ell(p; X) = k \log(p) + (n-k) \log(1-p)

Step 3: Maximize the Log-Likelihood

To find the value of pp that maximizes the log-likelihood, take the derivative with respect to pp and set it to zero:

d(p;X)dp=kpnk1p=0\frac{d\ell(p; X)}{dp} = \frac{k}{p} - \frac{n-k}{1-p} = 0

Solving for pp:

kp=nk1p\frac{k}{p} = \frac{n-k}{1-p}

This simplifies to:

p=knp = \frac{k}{n}

Thus, the MLE for pp is the sample proportion:

p^=kn\hat{p} = \frac{k}{n}

Step 4: Interpretation

The MLE for the probability of success pp in a binomial distribution is the observed proportion of successes. This result makes intuitive sense and is widely used in estimating probabilities from binary data.

Example 3: Estimating the Rate Parameter of a Poisson Distribution

Problem Setup

Suppose you are observing the number of events that occur in each of nn fixed intervals of time, and you believe the number of events in each interval follows a Poisson distribution with an unknown rate parameter λ\lambda. You observe kk events in total across all intervals. You want to estimate λ\lambda using MLE.

Step 1: Write Down the Likelihood Function

The likelihood function for the Poisson distribution, where xix_i events are observed in each interval, is given by:

L(λ;X)=i=1nλxieλxi!L(\lambda; X) = \prod_{i=1}^n \frac{\lambda^{x_i} e^{-\lambda}}{x_i!}

Where xix_i is the number of events observed in each interval.

Step 2: Simplify Using the Log-Likelihood

Taking the logarithm of the likelihood function to get the log-likelihood:

(λ;X)=i=1n(xilog(λ)λlog(xi!))\ell(\lambda; X) = \sum_{i=1}^n \left( x_i \log(\lambda) - \lambda - \log(x_i!) \right)

Ignoring the constant term i=1nlog(xi!)\sum_{i=1}^n \log(x_i!), the log-likelihood simplifies to:

(λ;X)=i=1n(xilog(λ)λ)\ell(\lambda; X) = \sum_{i=1}^n \left( x_i \log(\lambda) - \lambda \right)

Step 3: Maximize the Log-Likelihood

To find the value of λ\lambda that maximizes the log-likelihood, take the derivative with respect to λ\lambda and set it to zero:

d(λ;X)dλ=i=1n(xiλ1)=0\frac{d\ell(\lambda; X)}{d\lambda} = \sum_{i=1}^n \left( \frac{x_i}{\lambda} - 1 \right) = 0

Simplifying this:

i=1nxiλ=n\sum_{i=1}^n \frac{x_i}{\lambda} = n λ=1ni=1nxi\lambda = \frac{1}{n} \sum_{i=1}^n x_i

Thus, the MLE for λ\lambda is the sample mean of the observed data:

λ^=xˉ=1ni=1nxi\hat{\lambda} = \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i

Step 4: Interpretation

The MLE for the rate parameter λ\lambda of a Poisson distribution is the average number of events observed per interval. This makes intuitive sense, as the sample mean is the best estimate of the expected number of events in each interval.

Conclusion

Maximum Likelihood Estimation (MLE) is a powerful and versatile method for estimating the parameters of statistical models. By maximizing the likelihood function, MLE provides parameter estimates that are consistent, efficient, and asymptotically normal, making it a cornerstone of statistical inference.

In this article, we explored the principles of MLE, discussed its important properties, and demonstrated how to apply it to estimate parameters in various statistical models, including the normal, binomial, and Poisson distributions. Through detailed examples, we showed how MLE works in practice and how it can be used to derive meaningful parameter estimates from data.

Understanding and applying MLE is essential for data scientists and statisticians, as it forms the basis for many advanced techniques in machine learning, econometrics, and beyond. By mastering MLE, you can enhance your ability to build and interpret statistical models, leading to better insights and more informed decisions based on data.