Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation (MLE) is one of the most fundamental methods for fitting a parametric model to observed data. The core idea is simple: given a dataset, find the parameters that make the observed data most probable under the model.
Setup
Suppose we observe a dataset drawn i.i.d. from an unknown distribution . We posit a parametric family of joint distributions and want to find the that best explains the data — assigning high probability to frequently occurring data and low probability to rare observations.
Since the data are i.i.d., the joint likelihood factorizes into a product, we can write the likelihood of the dataset under the model as:
Thus the MLE objective is:
Log-Likelihood
Products are numerically unstable and hard to differentiate. Taking the logarithm — a monotone transformation — converts the product into a sum without changing the argmax:
In practice, we minimize the negative log-likelihood (NLL):
The factor normalizes the loss so it does not scale with dataset size, making it a Monte Carlo estimate of the population objective:
Example: Gaussian MLE
Suppose we model the data as , with parameters . The density of a single sample is:
Taking the log, the log-likelihood of a single sample is:
Summing over the dataset, the total log-likelihood is:
Solving for : Taking the derivative with respect to and setting it to zero:
Solving for : Taking the derivative with respect to and setting it to zero:
These are simply the sample mean and sample variance — the MLE recovers the intuitive estimators from first principles.
Example: Coin Flipping (Bernoulli MLE)
Suppose we flip a coin times and observe outcomes , where = heads and = tails. We model each flip as , with the single unknown parameter .The probability of a single outcome is:
The log-likelihood of a single sample is:
Summing over all flips, let denote the number of heads. The total log-likelihood is:
Solving for : Taking the derivative with respect to and setting it to zero:
The MLE estimate is simply the empirical fraction of heads — exactly what intuition suggests.
Example: Linear Regression (Gaussian noise MLE)
In linear regression, we observe pairs and model the output as:
This means the conditional distribution of given is:
The log-likelihood of a single pair is:
Summing over the dataset and dropping the constant term (which does not depend on ), the total log-likelihood is:
Maximizing over is equivalent to minimizing the sum of squared residuals:
Solving for : In matrix form, let be the design matrix and the target vector. The objective becomes . Taking the gradient and setting it to zero:
This is the well-known ordinary least squares (OLS) solution. The key insight is that minimizing MSE in linear regression is exactly MLE under a Gaussian noise assumption.
Drag the sliders below to see how the slope and intercept affect the NLL — the OLS solution is where it is minimized.
Connection to KL Divergence
Minimizing the NLL is equivalent to minimizing the KL divergence between and :
Since the second term does not depend on , minimizing reduces exactly to minimizing the NLL.
Summary
This blog has covered MLE and its application to simple parametric distribution families, illustrated through two classical examples: Bernoulli coin flipping and Gaussian linear regression. Note that in practice, rarely admits a closed-form solution like in the examples above — it is often a deep, expressive neural network, in which case the MLE objective must be optimized iteratively via gradient descent.