Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is one of the most fundamental methods for fitting a parametric model to observed data. The core idea is simple: given a dataset, find the parameters that make the observed data most probable under the model.

Setup

Suppose we observe a dataset drawn i.i.d. from an unknown distribution . We posit a parametric family of joint distributions and want to find the that best explains the data — assigning high probability to frequently occurring data and low probability to rare observations.

Since the data are i.i.d., the joint likelihood factorizes into a product, we can write the likelihood of the dataset under the model as:

Thus the MLE objective is:

Log-Likelihood

Products are numerically unstable and hard to differentiate. Taking the logarithm — a monotone transformation — converts the product into a sum without changing the argmax:

In practice, we minimize the negative log-likelihood (NLL):

The factor normalizes the loss so it does not scale with dataset size, making it a Monte Carlo estimate of the population objective:

Example: Gaussian MLE

Suppose we model the data as , with parameters . The density of a single sample is:

Taking the log, the log-likelihood of a single sample is:

Summing over the dataset, the total log-likelihood is:

Solving for : Taking the derivative with respect to and setting it to zero:

Solving for : Taking the derivative with respect to and setting it to zero:

These are simply the sample mean and sample variance — the MLE recovers the intuitive estimators from first principles.

Example: Coin Flipping (Bernoulli MLE)

Suppose we flip a coin times and observe outcomes , where = heads and = tails. We model each flip as , with the single unknown parameter .The probability of a single outcome is:

The log-likelihood of a single sample is:

Summing over all flips, let denote the number of heads. The total log-likelihood is:

Solving for : Taking the derivative with respect to and setting it to zero:

The MLE estimate is simply the empirical fraction of heads — exactly what intuition suggests.

Example: Linear Regression (Gaussian noise MLE)

In linear regression, we observe pairs and model the output as:

This means the conditional distribution of given is:

The log-likelihood of a single pair is:

Summing over the dataset and dropping the constant term (which does not depend on ), the total log-likelihood is:

Maximizing over is equivalent to minimizing the sum of squared residuals:

Solving for : In matrix form, let be the design matrix and the target vector. The objective becomes . Taking the gradient and setting it to zero:

This is the well-known ordinary least squares (OLS) solution. The key insight is that minimizing MSE in linear regression is exactly MLE under a Gaussian noise assumption.

Drag the sliders below to see how the slope and intercept affect the NLL — the OLS solution is where it is minimized.

Connection to KL Divergence

Minimizing the NLL is equivalent to minimizing the KL divergence between and :

Since the second term does not depend on , minimizing reduces exactly to minimizing the NLL.

Summary

This blog has covered MLE and its application to simple parametric distribution families, illustrated through two classical examples: Bernoulli coin flipping and Gaussian linear regression. Note that in practice, rarely admits a closed-form solution like in the examples above — it is often a deep, expressive neural network, in which case the MLE objective must be optimized iteratively via gradient descent.