Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is one of the most fundamental methods for fitting a parametric model to observed data. The core idea is simple: given a dataset, find the parameters that make the observed data most probable under the model.


Setup

Suppose we observe a dataset drawn i.i.d (independent and identically distributed sampling). from an unknown distribution . We posit a parametric family of joint distributions (note that is vector of multiple features) and want to find the that best explains the data - assigning high probability to frequently occurring data and low probability to rare observations.

Since the data was i.i.d. sampled, the joint likelihood factorizes into a product. We can write the likelihood of the dataset under the model as:

Thus the MLE objective is:


Log-Likelihood

Products are numerically unstable and hard to differentiate. Taking the logarithm - a monotone transformation - converts the product into a sum without changing the argmax:

In practice, we minimize the negative log-likelihood (NLL):

The factor normalizes the loss so it does not scale with dataset size, making it a Monte Carlo estimate of the population objective:


Examples

Example 1: Coin Flipping (Bernoulli MLE)

Suppose we flip a coin times and observe outcomes , where = heads and = tails. We model each flip as , with the single unknown parameter . The probability of a single outcome is:

The log-likelihood of a single sample is:

Summing over all flips, let denote the number of heads. The total log-likelihood is:

Solving for : Taking the derivative with respect to and setting it to zero:

The MLE estimate is simply the empirical fraction of heads - exactly what intuition suggests.

Example 2: Gaussian MLE

Suppose we model the data as , with parameters . The density of a single sample is:

Taking the log, the log-likelihood of a single sample is:

Summing over the dataset, the total log-likelihood is:

Solving for : Taking the derivative with respect to and setting it to zero:

Solving for : Taking the derivative with respect to and setting it to zero:

These are simply the sample mean and sample variance - the MLE recovers the intuitive estimators from first principles.

Example 3: Linear Regression (Gaussian noise MLE)

In linear regression, we observe pairs and model the output as:

This means the conditional distribution of given is:

The log-likelihood of a single pair is:

Summing over the dataset and dropping the constant term (which does not depend on ), the total log-likelihood is:

Maximizing over is equivalent to minimizing the sum of squared residuals:

Solving for : In matrix form, let be the design matrix and the target vector. The objective becomes . Taking the gradient and setting it to zero:

This is the well-known ordinary least squares (OLS) solution. The key insight is that minimizing MSE in linear regression is exactly MLE under a Gaussian noise assumption.


Connection to KL Divergence

Minimizing the NLL is equivalent to minimizing the KL divergence between and :

Since the second term does not depend on , minimizing reduces exactly to minimizing the NLL.


Summary

This note has covered MLE and its application to simple parametric distribution families, illustrated through three classical examples: Gaussian estimation, Bernoulli coin flipping, and Gaussian linear regression. Note that in practice, rarely admits a closed-form solution like in the examples above - it is often a deep, expressive neural network, in which case the MLE objective must be optimized iteratively via gradient descent.