Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is one of the most fundamental methods for fitting a parametric model to observed data. The core idea is simple: given a dataset, find the parameters $θ$ that make the observed data most probable under the model.

Setup

Suppose we observe a dataset $X = {x^{(1)}, \dots, x^{(N)}}$ drawn i.i.d (independent and identically distributed sampling). from an unknown distribution $p_{d a t a} (x)$ . We posit a parametric family of joint distributions $p_{θ} (x)$ (note that $x \in R^{d}$ is vector of multiple features) and want to find the $θ$ that best explains the data - assigning high probability to frequently occurring data and low probability to rare observations.

Since the data was i.i.d. sampled, the joint likelihood factorizes into a product. We can write the likelihood of the dataset under the model as:

p_{θ} (X) = i = 1 \prod N p_{θ} (x^{(i)}) (1)

Thus the MLE objective is:

\hat{θ}_{MLE} = θ ar g max i = 1 \prod N p_{θ} (x^{(i)}) (2)

Log-Likelihood

Products are numerically unstable and hard to differentiate. Taking the logarithm - a monotone transformation - converts the product into a sum without changing the argmax:

\hat{θ}_{MLE} = θ ar g max i = 1 \sum N lo g p_{θ} (x^{(i)}) (3)

In practice, we minimize the negative log-likelihood (NLL):

\hat{L}_{MLE} (θ) := - \frac{1}{N} i = 1 \sum N lo g p_{θ} (x^{(i)}) (4)

The $\frac{1}{N}$ factor normalizes the loss so it does not scale with dataset size, making it a Monte Carlo estimate of the population objective:

L_{MLE} (θ) = - E_{x \sim p_{d a t a} (x)} [lo g p_{θ} (x)] (5)

Examples

Example 1: Coin Flipping (Bernoulli MLE)

Suppose we flip a coin $N$ times and observe outcomes $x^{(i)} \in {0, 1}$ , where $1$ = heads and $0$ = tails. We model each flip as $x^{(i)} \sim i.i.d. Bernoulli (p)$ , with the single unknown parameter $θ = p \in [0, 1]$ . The probability of a single outcome is:

p_{θ} (x) = p^{x} (1 - p)^{1 - x} (6)

The log-likelihood of a single sample is:

lo g p_{θ} (x) = x lo g p + (1 - x) lo g (1 - p) (7)

Summing over all $N$ flips, let $H = \sum_{i = 1}^{N} x^{(i)}$ denote the number of heads. The total log-likelihood is:

ℓ (p) = i = 1 \sum N lo g p_{θ} (x^{(i)}) = H lo g p + (N - H) lo g (1 - p) (8)

Solving for $\overset{p}{^}$ : Taking the derivative with respect to $p$ and setting it to zero:

\frac{\partial ℓ}{\partial p} = \frac{H}{p} - \frac{N - H}{1 - p} = 0 \Rightarrow H (1 - p) = (N - H) p \Rightarrow \overset{p}{^} = \frac{H}{N} (9)

The MLE estimate is simply the empirical fraction of heads - exactly what intuition suggests.

Example 2: Gaussian MLE

Suppose we model the data as $x^{(i)} \sim i.i.d. N (μ, σ^{2})$ , with parameters $θ = (μ, σ^{2})$ . The density of a single sample is:

p_{θ} (x) = \frac{1}{2 π σ ^{2}} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}}) (10)

Taking the log, the log-likelihood of a single sample is:

lo g p_{θ} (x) = - \frac{1}{2} lo g (2 π σ^{2}) - \frac{( x - μ ) ^{2}}{2 σ ^{2}} (11)

Summing over the dataset, the total log-likelihood is:

ℓ (μ, σ^{2}) = i = 1 \sum N lo g p_{θ} (x^{(i)}) = - \frac{N}{2} lo g (2 π σ^{2}) - \frac{1}{2 σ ^{2}} i = 1 \sum N (x^{(i)} - μ)^{2} (12)

Solving for $\overset{μ}{^}$ : Taking the derivative with respect to $μ$ and setting it to zero:

\frac{\partial ℓ}{\partial μ} = \frac{1}{σ ^{2}} i = 1 \sum N (x^{(i)} - μ) = 0 \Rightarrow \overset{μ}{^} = \frac{1}{N} i = 1 \sum N x^{(i)} (13)

Solving for $\overset{σ}{^}^{2}$ : Taking the derivative with respect to $σ^{2}$ and setting it to zero:

\frac{\partial ℓ}{\partial σ ^{2}} = - \frac{N}{2 σ ^{2}} + \frac{1}{2 σ ^{4}} i = 1 \sum N (x^{(i)} - μ)^{2} = 0 \Rightarrow \overset{σ}{^}^{2} = \frac{1}{N} i = 1 \sum N (x^{(i)} - \overset{μ}{^})^{2} (14)

These are simply the sample mean and sample variance - the MLE recovers the intuitive estimators from first principles.

Example 3: Linear Regression (Gaussian noise MLE)

In linear regression, we observe pairs ${(x^{(i)}, y^{(i)})}_{i = 1}^{N}$ and model the output as:

y^{(i)} = w^{⊤} x^{(i)} + ε^{(i)}, ε^{(i)} \sim i.i.d. N (0, σ^{2}) (15)

This means the conditional distribution of $y^{(i)}$ given $x^{(i)}$ is:

p_{w} (y^{(i)} ∣ x^{(i)}) = N (y^{(i)}; w^{⊤} x^{(i)}, σ^{2}) = \frac{1}{2 π σ ^{2}} exp (- \frac{( y ^{(i)} - w ^{⊤} x ^{(i)} ) ^{2}}{2 σ ^{2}}) (16)

The log-likelihood of a single pair is:

lo g p_{w} (y^{(i)} ∣ x^{(i)}) = - \frac{1}{2} lo g (2 π σ^{2}) - \frac{( y ^{(i)} - w ^{⊤} x ^{(i)} ) ^{2}}{2 σ ^{2}} (17)

Summing over the dataset and dropping the constant term (which does not depend on $w$ ), the total log-likelihood is:

ℓ (w) = - \frac{1}{2 σ ^{2}} i = 1 \sum N (y^{(i)} - w^{⊤} x^{(i)})^{2} (18)

Maximizing $ℓ (w)$ over $w$ is equivalent to minimizing the sum of squared residuals:

\hat{w}_{MLE} = w ar g min i = 1 \sum N (y^{(i)} - w^{⊤} x^{(i)})^{2} (19)

Solving for $\hat{w}$ : In matrix form, let $X \in R^{N \times d}$ be the design matrix and $y \in R^{N}$ the target vector. The objective becomes $∥ y - Xw ∥^{2}$ . Taking the gradient and setting it to zero:

\nabla_{w} ∥ y - Xw ∥^{2} = - 2 X^{⊤} (y - Xw) = 0 \Rightarrow X^{⊤} X w = X^{⊤} y \Rightarrow \hat{w} = (X^{⊤} X)^{- 1} X^{⊤} y (20)

This is the well-known ordinary least squares (OLS) solution. The key insight is that minimizing MSE in linear regression is exactly MLE under a Gaussian noise assumption.

Connection to KL Divergence

Minimizing the NLL is equivalent to minimizing the KL divergence between $p_{d a t a} (x)$ and $p_{θ} (x)$ :

D_{K L} (p_{d a t a} (x) ∥ p_{θ} (x)) = E_{x \sim p_{d a t a} (x)} [lo g \frac{p _{d a t a} ( x )}{p _{θ} ( x )}] = - E_{x \sim p_{d a t a} (x)} [lo g p_{θ} (x)] + constant w.r.t. θ E_{x \sim p_{d a t a} (x)} [lo g p_{d a t a} (x)] (21)

Since the second term does not depend on $θ$ , minimizing $D_{K L}$ reduces exactly to minimizing the NLL.

Summary

This note has covered MLE and its application to simple parametric distribution families, illustrated through three classical examples: Gaussian estimation, Bernoulli coin flipping, and Gaussian linear regression. Note that in practice, $p_{θ} (x)$ rarely admits a closed-form solution like in the examples above - it is often a deep, expressive neural network, in which case the MLE objective must be optimized iteratively via gradient descent.

Notations

See the notation reference for a summary of symbols used across all notes.

☕ 💻 ✍

Explorer

Maximum Likelihood Estimation