Variational AutoEncoder

In a lot of generative models courses, the starting point of training a neural network to generate new realistic data is the Variational Autoencoder (VAE). This model has its origins in the AutoEncoder (AE), which serves a different purpose: to reconstruct data:

Figure 2: The standard autoencoder compresses input $x$ into a fixed latent code $z$ via the encoder, then reconstructs $x^{'}$ via the decoder. $z$ is a single point with no probabilistic structure.

Formally, AE consists of two parts: an encoder $f_{ϕ}$ that compresses the input $x$ into a compact latent representation $z = f_{ϕ} (x)$ , and a decoder $g_{θ}$ that reconstructs the input from that representation $\hat{x} = g_{θ} (z)$ . The network is trained end-to-end by minimizing a reconstruction loss, typically mean squared error:

L = ∥ x - g_{θ} (f_{ϕ} (x)) ∥^{2}

The bottleneck forces the encoder to learn a compressed, meaningful representation of the data. Once trained, the latent space can be used for tasks like dimensionality reduction or feature extraction. However, autoencoders have a critical limitation to serve as a generative model: the latent space has no guaranteed structure. Points in latent space are not organized in any principled way, so randomly sampling an arbitrary $z$ and decoding it often yields garbage. There is no way to smoothly interpolate between examples or generate novel, realistic samples.

The Variational Autoencoder (VAE), introduced by Kingma & Welling (2013), addresses this by imposing a probabilistic structure on the latent space. Instead of mapping $x$ to a fixed point $z$ , the encoder $q_{ϕ} (z ∣ x)$ outputs the parameters of a distribution (usually a Gaussian). A latent vector $z$ is then sampled from this distribution rather than deterministically computed. The decoder $p_{θ} (x ∣ z)$ learns to reconstruct $x$ from these sampled latents. A prior $p (z)$ is placed over the latent space, and the encoder is regularized to stay close to this prior via KL divergence. This shift - from a deterministic bottleneck to a learned posterior - gives the latent space two important properties:

Continuity: Nearby points in latent space decode to similar outputs. Because the encoder maps each input $x$ to a distribution over $z$ rather than a single point, inputs that are similar naturally produce overlapping distributions - and thus neighboring regions in latent space correspond to similar decoded outputs.
Completeness: Any point sampled from the prior produces a meaningful output. By regularizing the encoder’s posterior $q_{ϕ} (z ∣ x)$ to stay close to the prior $N (0, I)$ , the model ensures that the high-probability regions of the latent space are densely covered with meaningful structure, so random samples from the prior reliably decode into coherent outputs.

VAE Figure 2: The VAE encoder $q_{ϕ} (z ∣ x)$ maps input $x$ to a Gaussian distribution in latent space, regularized toward the prior $p (z) = N (0, I)$ . A vector $z$ is then sampled from $q_{ϕ} (z ∣ x)$ and decoded by $p_{θ} (x ∣ z)$ to reconstruct $x^{'}$ .

1. Constructions of VAE

Suppose we have a dataset of samples drawn i.i.d. from an unknown, complex distribution $p_{d a t a} (x)$ . Since the true form of $p_{d a t a}$ is unknown, we cannot generate new sample by drawing from $p_{d a t a} (x)$ directly. The goal of a generative model is to learn a tractable approximation $p_{θ} (x)$ from this finite dataset by minimizing a divergence $D_{f}$ between the two distributions.

genai Figure 3: The goal of a generative model: find parameters $θ$ that minimize the divergence $D (p_{d a t a} (x), p_{θ} (x))$ between the true data distribution (blue) and the learned model distribution (yellow). As $D \to 0$ , the model distribution increasingly overlaps with the data distribution.

Once the optimal parameters $θ$ are found, $p_{θ} (x)$ can be used to serve as a proxy for $p_{d a t a} (x)$ , enabling two key capabilities:

Generation: Draw new, realistic samples from $p_{d a t a} (x)$ via sampling methods such as Monte Carlo Sampling via $p_{θ} (x)$ .
Evaluation: Assess how likely a given sample $x^{'}$ is under the learned distribution $p_{θ} (x)$ - for instance, judging whether an image $x^{'}$ looks realistic by computing the likelihood $p_{θ} (x^{'})$ .

In the case of VAEs, $D_{f}$ is the KL divergence $D_{K L}$ :

D_{K L} (p_{d a t a} (x) ∥ p_{θ} (x))

KL divergence intuition

$D_{K L} (p_{d a t a} (x) ∥ p_{θ} (x)) = \int p_{d a t a} (x) lo g \frac{p _{d a t a} ( x )}{p _{θ} ( x )} d x = E_{x \sim p_{d a t a} (x)} [lo g \frac{p _{d a t a} ( x )}{p _{θ} ( x )}] = E_{x \sim p_{d a t a} (x)} [lo g p_{d a t a} (x) - lo g p_{θ} (x)]$
As we can see, the KL divergence measures the expected log-likelihood difference between $p_{d a t a} (x)$ and $p_{θ} (x)$ . Therefore, minimizing $D_{K L} (p_{d a t a} (x) ∥ p_{θ} (x))$ pushes $p_{θ} (x)$ to assign high likelihood to real data $x$ sampled from $p_{d a t a} (x)$ .

Now, We can rewrite the KL divergence as follow:

D_{K L} (p_{d a t a} (x) ∥ p_{θ} (x)) = E_{x \sim p_{d a t a} (x)} [lo g p_{d a t a} (x) - lo g p_{θ} (x)] = - E_{x \sim p_{d a t a} (x)} [lo g p_{θ} (x)] + E_{x \sim p_{d a t a} (x)} [lo g p_{d a t a} (x)] = - E_{x \sim p_{d a t a} (x)} [lo g p_{θ} (x)] + C

The constant $E_{x \sim p_{d a t a} (x)} [lo g p_{d a t a} (x)]$ is simply the entropy of $p_{d a t a} (x)$ and is independent of $θ$ . This is very convenient as $p_{d a t a} (x)$ is unknown and minimizing $D_{K L}$ is equivalent to maximizing the expected log-likelihood of the data $x$ under $p_{θ} (x)$ :

θ ar g max E_{x \sim p_{d a t a} (x)} [lo g p_{θ} (x)]

Which is precisely the maximum likelihood estimation (MLE) objective. In practice we replace this population expectation $E_{x \sim p_{d a t a} (x)}$ by its Monte Carlo estimate, yielding the empirical MLE objective now becomes:

\hat{L}_{MLE} (θ) := - \frac{1}{N} i = 1 \sum N lo g p_{θ} (x^{(i)}) (1)

Where $N$ is the number of samples in the dataset. This objective is then optimized via SGD over minibatches.

1.1 Decoder (Generator)

Returning to the autoencoder setting, the goal is to generate a new sample $x$ from a latent variable $z$ via a neural network decoder $p_{θ} (x ∣ z)$ . We can express the target distribution $p_{θ} (x)$ in equation (1) as the marginal distribution:

p_{θ} (x) = \int p_{θ} (x ∣ z) p (z) d z

Unfortunately, directly optimizing this objective via MLE is intractable: it requires integrating over the entire high-dimensional latent space, and since $p_{θ} (x ∣ z)$ is a deep, expressive neural network with no closed-form solution, evaluating this integral exactly is computationally infeasible. To make this optimization tractable, we need a way to focus only on latent states $z$ that are likely to have generated the current input $x$ , rather than integrating over the entire latent space.

1.2 Encoder (Inference Model)

We can reframe the problem: instead of integrating over all possible $z$ , can we identify which latent states $z$ are most likely to have produced the observed sample $x$ ? This leads us to consider the posterior distribution $p_{θ} (z ∣ x)$ , which by Bayes’ rule is:

p_{θ} (z ∣ x) = \frac{p _{θ} ( x ∣ z ) p ( z )}{p _{θ} ( x )}

However, computing this posterior directly is equally intractable, as the denominator $p_{θ} (x)$ is the same marginal likelihood we started with. This motivates approximating the true posterior with a learned inference model:

q_{ϕ} (z ∣ x) \approx p_{θ} (z ∣ x)

And yes, this is exactly the encoder of the VAE! Which can be trained to concentrates probability mass on the $z$ state that is most relevant to $x$ .

2. ELBO (Evidence Lower Bound)

Now that we have a controllable encoder model to generate $z \sim q_{ϕ} (z ∣ x)$ . We can redefine the MLE optimization goal using $q_{ϕ} (z ∣ x)$ .

p_{θ} (x) lo g p_{θ} (x) = \int p_{θ} (z, x) d z = lo g \int p_{θ} (z, x) d z = lo g \int q_{ϕ} (z ∣ x) \frac{p _{θ} ( z , x )}{q _{ϕ} ( z ∣ x )} d z = lo g E_{z \sim q_{ϕ} (z ∣ x)} [\frac{p _{θ} ( z , x )}{q _{ϕ} ( z ∣ x )}]

The learning objective is now tractable. Now according to Jensen’s inequality. We have the the evidence lower bound $L_{E L BO}$ where:

lo g E_{z \sim q_{ϕ} (z ∣ x)} [\frac{p _{θ} ( z , x )}{q _{ϕ} ( z ∣ x )}] \geq E_{z \sim q_{ϕ} (z ∣ x)} [lo g \frac{p _{θ} ( z , x )}{q _{ϕ} ( z ∣ x )}] = L_{E L BO} (2)

Deriving further, we can see that $L_{E L BO}$ consists of 2 terms:

L_{E L BO} = E_{z \sim q_{ϕ} (z ∣ x)} [lo g \frac{p _{θ} ( z , x )}{q _{ϕ} ( z ∣ x )}] = E_{z \sim q_{ϕ} (z ∣ x)} [lo g \frac{p _{θ} ( x ∣ z ) p ( z )}{q _{ϕ} ( z ∣ x )}] = E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - E_{z \sim q_{ϕ} (z ∣ x)} [lo g \frac{q _{ϕ} ( z ∣ x )}{p ( z )}] = reconstruction error E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - regularizing term D_{K L} (q_{ϕ} (z ∣ x) ∥ p (z)) (3)

Reconstruction term - $E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)]$ : This is the reconstruction objective from the standard AE, but now evaluated only over $z$ sampled from the encoder, making it tractable.
Regularization term - $D_{K L} (q_{ϕ} (z ∣ x) ∥ p (z))$ : This penalizes the encoder’s posterior $q_{ϕ} (z ∣ x)$ for deviating from the prior $p (z) = N (0, I)$ , enforcing the latent space structure needed for generation.

Why is the learning objective tractable now?

The original MLE objective $lo g p_{θ} (x) = lo g \int p_{θ} (x ∣ z) p (z) d z$ is intractable because it requires integrating $p_{θ} (x ∣ z)$ - a neural network - over the entire latent space. The ELBO resolves this in two key ways:

1. Replacing the integral with a tractable expectation: Instead of integrating over all $z$ , the reconstruction term $E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)]$ only requires sampling $z$ from the encoder $q_{ϕ} (z ∣ x)$ , which concentrates mass on the latent regions most relevant to $x$ .

2. A closed-form KL term: $q_{ϕ} (z ∣ x)$ is usually modeled as a simple distribution, usually a gaussian and the KL divergence between two Gaussians has a closed-form solution - no integration is needed at all. And it can also be easily trainable via the reparameterization trick

Together, the two terms create a natural tension: maximizing $L_{E L BO}$ encourages the decoder to recover the original input $x$ as accurately as possible from latent samples $z \sim q_{ϕ} (z ∣ x)$ (reconstruction term), while the regularization term pulls the encoder’s posterior $q_{ϕ} (z ∣ x)$ back toward the prior $p (z)$ . The VAE learns by striking a balance between these two competing objectives.

ELBO as a Divergence Bound

So what is the relationship between ELBO and the true MLE goal $p_{θ} (x)$ ? Recall that maximum likelihood training amounts to minimizing the KL divergence between $p_{d a t a} (x)$ and the learned distribution $p_{θ} (x)$ :
$D_{K L} (p_{d a t a} (x) ∥ p_{θ} (x))$
Since this term is intractable in general, the variational framework of VAE introduces a joint comparison $z$ . Specifically, consider two join distributions

Generative Join (Decoder): $p_{θ} (z, x)$

Inference Join (Encoder): $q_{ϕ} (z, x)$

The total error bound is to match these join together is:
$D_{K L} (q_{ϕ} (x, z) ∥ p_{θ} (x, z)) = \iint q_{ϕ} (x, z) lo g \frac{q _{ϕ} ( x , z )}{p _{θ} ( x , z )} d x d z = \iint p_{d a t a} (x) q_{ϕ} (z ∣ x) lo g (\frac{p _{d a t a} ( x ) q _{ϕ} ( z ∣ x )}{p _{θ} ( x ) p _{θ} ( z ∣ x )}) d z d x = \int p_{d a t a} (x) lo g (\frac{p _{d a t a} ( x )}{p _{θ} ( x )}) d x + \iint p_{d a t a} (x) q_{ϕ} (z ∣ x) lo g (\frac{q _{ϕ} ( z ∣ x )}{p _{θ} ( z ∣ x )}) d z d x = True Modeling Error D_{K L} (p_{d a t a} (x) ∥ p_{θ} (x)) + Inference Error E_{x \sim p_{d a t a} (x)} [D_{K L} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x))]$
Thus we have
$D_{K L} (q_{ϕ} (x, z) ∥ p_{θ} (x, z)) \geq D_{K L} (p_{d a t a} (x) ∥ p_{θ} (x))$
Where equality happens when inference error is zeros, which also means the encoder $q_{ϕ} (z ∣ x)$ perfectly model the unknow posterior distribution $p_{θ} (z ∣ x)$ .

Note that $L_{E L BO}$ can also be rewritten as :
$L_{E L BO} = lo g p_{θ} (x) - D_{K L} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x)) \to lo g p_{θ} (x) - L_{E L BO} = D_{K L} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x))$
We can see that the gap between the true log-likelihood $lo g p_{θ} (x)$ and the ELBO is precisely the inference error of the current sample $x$ . Maximizing the ELBO therefore directly reduces this gap. Specifically, optimizing the encoder $q_{ϕ} (z ∣ x)$ tightens the bound by bringing the approximate posterior closer to the true one $p_{θ} (z ∣ x)$ , while optimizing the decoder $p_{θ} (x ∣ z)$ pushes the $p_{θ} (x)$ itself upward - lifting the entire lower bound and improving the overall log-likelihood.

3. Gaussian VAEs

The most common instantiation of the VAE framework is the Gaussian VAE, where the encoder, decoder and prior are modeled as Gaussians.

VAE_v2 Figure 4: Overview of the Gaussian VAE. Each input $x$ is encoded into a class-conditional Gaussian $q_{ϕ} (z ∣ x)$ (colored clusters). The aggregate posterior $q_{ϕ} (z) = E_{x \sim p_{data} (x)} [q_{ϕ} (z ∣ x)]$ is matched to the isotropic prior $p (z)$ via the KL term in the ELBO. Samples from $q_{ϕ} (z ∣ x)$ are decoded by $p_{θ} (x ∣ z)$ to produce reconstructions $x^{'}$ , whose marginal $p_{θ} (x)$ approximates the data distribution.

3.1 The encoder part

For each input $x$ , the encoder produces a Gaussian distribution centered at $μ_{ϕ} (x)$ with variance $σ_{ϕ}^{2} (x)$ , so that similar inputs yield overlapping distributions in the latent space:

z = μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ε, ε \sim N (0, I) \Rightarrow q_{ϕ} (z ∣ x) = N (z; μ_{ϕ} (x), diag (σ_{ϕ}^{2} (x)))

This is the reparameterization trick: by expressing $z$ as a deterministic function of $ϕ$ and a fixed noise variable $ε$ , the stochasticity is separated from the parameters, making the sampling step differentiable and allowing gradients to flow back through $z$ to the encoder.

Since the prior $p (z) = N (0, I)$ is also Gaussian, the KL divergence between two Gaussians admits a closed-form solution - no numerical integration required:

L_{K L} = D_{K L} (N (μ_{ϕ}, diag (σ_{ϕ}^{2})) ∥ N (0, I)) L_{K L} = - \frac{1}{2} j = 1 \sum d (1 + lo g σ_{j}^{2} - μ_{j}^{2} - σ_{j}^{2})

With $d$ is the number of dimension of the latent space.

Derivation of closed-form KL loss

Since both $q_{ϕ} (z ∣ x) = N (μ_{ϕ}, diag (σ_{ϕ}^{2}))$ and $p (z) = N (0, I)$ are diagonal, the KL factorizes over dimensions. It suffices to derive for a single scalar dimension $z \sim N (μ, σ^{2})$ vs $z \sim N (0, 1)$ :
$D_{K L} (N (μ, σ^{2}) ∥ N (0, 1)) = \int N (z; μ, σ^{2}) lo g \frac{N ( z ; μ , σ ^{2} )}{N ( z ; 0 , 1 )} d z = E_{q} [lo g N (z; μ, σ^{2}) - lo g N (z; 0, 1)] = E_{q} [(- \frac{1}{2} lo g (2 π σ^{2}) - \frac{( z - μ ) ^{2}}{2 σ ^{2}}) - (- \frac{1}{2} lo g (2 π) - \frac{z ^{2}}{2})] = E_{q} [- \frac{1}{2} lo g σ^{2} - \frac{( z - μ ) ^{2}}{2 σ ^{2}} + \frac{z ^{2}}{2}] = - \frac{1}{2} lo g σ^{2} - \frac{1}{2 σ ^{2}} = σ^{2} E_{q} [(z - μ)^{2}] + \frac{1}{2} = σ^{2} + μ^{2} E_{q} [z^{2}] = - \frac{1}{2} lo g σ^{2} - \frac{1}{2} + \frac{σ ^{2} + μ ^{2}}{2} = - \frac{1}{2} (1 + lo g σ^{2} - μ^{2} - σ^{2})$
Summing over all $d$ independent dimensions:
$D_{K L} (q_{ϕ} (z ∣ x) ∥ p (z)) = - \frac{1}{2} j = 1 \sum d (1 + lo g σ_{j}^{2} - μ_{j}^{2} - σ_{j}^{2})$

Taking the gradient of the KL term with respect to $μ_{j}$ and $σ_{j}^{2}$ we have:

\frac{\partial}{\partial μ _{j}} D_{K L} \frac{\partial}{\partial σ _{j}^{2}} D_{K L} = \frac{\partial}{\partial μ _{j}} [- \frac{1}{2} (1 + lo g σ_{j}^{2} - μ_{j}^{2} - σ_{j}^{2})] = μ_{j} = - \frac{1}{2} (\frac{1}{σ _{j}^{2}} - 1)

Setting these to zero $\Rightarrow \overset{μ}{^}_{j} = 0, \overset{σ}{^}_{j}^{2} = 1$ . Therefore minimizing the KL term alone pushes the encoder toward:

q_{ϕ} (z ∣ x) ⟶ N (0, I) = p (z)

This is why the reconstruction term is essential: it pulls $μ_{j}$ away from zero and $σ_{j}^{2}$ toward smaller values to make $z$ informative about $x$ .

3.2 The Decoder part

To counteract collapse from the regularization term, the reconstruction term enforces that $z$ remains informative about $x$ . Specifically, the decoder is trained to output a sample $x^{'}$ that resembles the original input as closely as possible, given a latent vector $z$ drawn from the encoder’s posterior $q_{ϕ} (z ∣ x)$ . Note that $x^{'}$ need not be identical to $x$ :

x = μ_{θ} (z) + σ ⊙ ε \sim N (0, I) \Rightarrow p_{θ} (x ∣ z) = N (x; μ_{θ} (z), I σ)

Here $μ_{θ} (z)$ is the output of a neural network decoder, and $σ$ is a fixed hyperparameter controlling the spread of the output distribution - large $σ$ allows more deviation from the input, while small $σ$ forces the reconstruction to stay close to the input $x$ . The reconstruction loss can now be rewritten as:

E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] = E_{z \sim q_{ϕ} (z ∣ x)} [- \frac{1}{2 σ ^{2}} ∥ x - μ_{θ} (z) ∥^{2}] + lo g (\frac{1}{2 π σ ^{2}}) \propto - E_{z \sim q_{ϕ} (z ∣ x)} [∥ x - μ_{θ} (z) ∥^{2}]

L_{reco n} = E_{z \sim q_{ϕ} (z ∣ x)} [∥ x - μ_{θ} (z) ∥^{2}]

This is equivalent to minimizing the expected MSE between the input $x$ and the decoder output $μ_{θ} (z)$ - which is similar to the original AE loss.

3.3 Overall Training Procedure

With both the encoder and decoder defined, the full training procedure follows directly from maximizing the ELBO. Each training step processes a minibatch of inputs:

Algorithm: Training Gaussian VAE with ELBO Input: Dataset X = {x^{(1)}, \dots, x^{(N)}}, batch size B, learning rate η Output: Encoder parameters ϕ, Decoder parameters θ 1. Initialize ϕ, θ randomly 2. repeat 3. Sample minibatch {x^{(1)}, \dots, x^{(B)}} \sim X 4. // Encoder forward pass 5. (μ^{(i)}, σ^{(i)}) \leftarrow Encoder_{ϕ} (x^{(i)}) \forall i 6. // Reparameterization trick 7. ε^{(i)} \sim N (0, I) 8. z^{(i)} \leftarrow μ^{(i)} + σ^{(i)} ⊙ ε^{(i)} 9. // Decoder forward pass 10. \hat{x}^{(i)} \leftarrow Decoder_{θ} (z^{(i)}) \forall i 11. // Compute ELBO 12. L_{recon} \leftarrow \frac{1}{B} \sum_{i} lo g p_{θ} (x^{(i)} ∣ z^{(i)}) 13. L_{KL} \leftarrow \frac{1}{B} \sum_{i} D_{K L} (q_{ϕ} (z ∣ x^{(i)}) ∥ p (z)) 14. L_{ELBO} \leftarrow L_{recon} - L_{KL} 15. // Update parameters 16. (ϕ, θ) \leftarrow (ϕ, θ) + η \cdot \nabla_{ϕ, θ} L_{ELBO} 17. until convergence

4. Drawbacks of Gaussian VAEs

Despite its elegance, the Gaussian VAE has several well-known limitations:

4.1 Blurry reconstructions

Modeling $p_{θ} (x ∣ z)$ as a Gaussian with a fixed variance corresponds to minimizing MSE, which tends to average over multiple plausible reconstructions using the sampled code $z$ .

Proof of blurry reconstructions in VAEs

Recall the per-sample reconstruction loss:
$L_{reco n} = E_{z \sim q_{ϕ} (z ∣ x)} [∥ x - μ_{θ} (z) ∥^{2}]$
When training over the full dataset, we optimize its expectation over all $x \sim p_{d a t a}$ :
$E_{x \sim p_{d a t a}, z \sim q_{ϕ} (z ∣ x)} [∥ x - μ_{θ} (z) ∥^{2}] = \iint p_{d a t a} (x) q_{ϕ} (z ∣ x) ∥ x - μ_{θ} (z) ∥^{2} d x d z = \iint q_{ϕ} (z) q_{ϕ} (x ∣ z) ∥ x - μ_{θ} (z) ∥^{2} d x d z = \int q_{ϕ} (z) [\int q_{ϕ} (x ∣ z) ∥ x - μ_{θ} (z) ∥^{2} d x] d z = E_{z \sim q_{ϕ} (z)} [E_{x \sim q_{ϕ} (x ∣ z)} [∥ x - μ_{θ} (z) ∥^{2}]]$
Since $θ$ only appears in the inner expectation and has no effect on the aggregate posterior $q_{ϕ} (z)$ , the outer expectation acts as a constant weight. It suffices to minimize the inner term with respect to $μ_{θ} (z)$ for each fixed $z$ . Taking the gradient and setting it to zero:
$\frac{\partial}{\partial μ _{θ}} E_{x \sim q_{ϕ} (x ∣ z)} [∥ x - μ_{θ} (z) ∥^{2}] = E_{x \sim q_{ϕ} (x ∣ z)} [- 2 (x - μ_{θ} (z))] = 0 \Rightarrow μ_{θ}^{*} (z) = E_{x \sim q_{ϕ} (x ∣ z)} [x]$
The optimal decoder output is the conditional mean of $x$ given $z$ under the encoder’s inverse distribution $q_{ϕ} (x ∣ z)$ . When multiple distinct images $x$ map to similar latent codes $z$ , the MSE loss forces the decoder to output their average - producing blurry reconstructions.

For image data, this produces blurry outputs rather than sharp, realistic samples.

So what could be done to improve this? - One natural idea is to make the encoder or decoder more expressive by adding more layers to capture richer, more complex latent structure. However, increasing depth alone does not solve the Limited posterior expressiveness problem of the encoder. Moreover, when the decoder grows too powerful, it can reconstruct $x$ without relying on $z$ at all, triggering posterior collapse.

4.2 Limited posterior expressiveness

The diagonal Gaussian assumption for $q_{ϕ} (z ∣ x)$ restricts the variational family distribution to an axis-aligned ellipsoid (i.e., zero off-diagonal covariance). If the true posterior $p_{θ} (z ∣ x)$ has complex, multimodal, or highly correlated structure, a single Gaussian cannot capture it - leading to a persistently loose ELBO bound regardless of the encoder capacity.

4.3 Posterior Collapse

Why does this happen? One key reason is that when the decoder becomes too expressive (a sufficiently deep neural network), at some point during training, for some $x \sim p_{d a t a} (x)$ , the decoder finds it easier to simply approximate the data distribution directly:

p_{θ} (x ∣ z) \approx p_{d a t a} (x)

This sounds like a win for reconstruction quality, but it also breaks the encoder. In other words, the decoder has learned to ignore $z$ , the VAE’s output now barely changes no matter what latent code it receives. Recall the ELBO:

L_{E L BO} = reconstruction E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - regularization D_{K L} (q_{ϕ} (z ∣ x) ∥ p (z))

When the decoder is powerful enough to reconstruct $x$ without using $z$ , by exploiting its own internal structure, the reconstruction term becomes approximately constant with respect to $z$ . The gradient signal that would normally force the encoder to encode information into $z$ disappears. The optimizer then finds the path of least resistance: collapse the KL term to zero by driving $q_{ϕ} (z ∣ x) \to p (z)$ . At this point, $z$ and $x$ become statistically independent, the latent code carries no information about the input, and the decoder can no longer be used to control the output. The VAE reduces to a decoder-only model. so the ELBO degenerates to:

L_{E L BO} \approx lo g p_{θ} (x)

Proof for the independence of $x$ and $z$ after postierior corruption

We can rewrite the regularization term, averaged over all $x \sim p_{d a t a} (x)$ , as:
$E_{x \sim p_{d a t a} (x)} [D_{K L} (q_{ϕ} (z ∣ x) ∣ p (z))] = \iint p_{d a t a} (x) q_{ϕ} (z ∣ x) lo g \frac{q _{ϕ} ( z ∣ x )}{p ( z )} d z d x = \iint q_{ϕ} (z, x) lo g (\frac{q _{ϕ} ( x ∣ z ) q _{ϕ} ( z )}{p ( z ) p _{d a t a} ( x )}) d z d x = \iint q_{ϕ} (z, x) lo g (\frac{p _{ϕ} ( x ∣ z )}{p _{d a t a} ( x )}) d z d x + \iint q_{ϕ} (z, x) lo g (\frac{q _{ϕ} ( z )}{p ( z )}) d z d x = \iint q_{ϕ} (z, x) lo g (\frac{q _{ϕ} ( x , z )}{p _{d a t a} ( x ) q _{ϕ} ( z )}) d z d x + \int q_{ϕ} (z) lo g (\frac{q _{ϕ} ( z )}{p ( z )}) d z = mutual information I (x, z) + D_{K L} (q_{ϕ} (z) ∥ p (z))$
This decomposition reveals what posterior collapse actually destroys. When the regularization term is driven to zero, both components must vanish simultaneously:

$I (x, z) \to 0$ - The latent code $z$ becomes statistically independent of $x$ (recall independent property), i.e. $q_{ϕ} (x, z) = q_{ϕ} (z) p_{d a t a} (x)$ .

$D_{K L} (q_{ϕ} (z) ∥ p (z)) \to 0$ - The aggregate posterior $q_{ϕ} (z)$ collapses to the isotropic Gaussian $p (z) = N (0, I)$ . The latent space loses all input class-specific structure, the per-class mixture components that distinguish different types of $x$ are wiped out.

4.4 Mismatch between aggregate posterior and prior

Another weakness of standard VAEs is that even if each individual posterior $q_{ϕ} (z ∣ x)$ is close to the prior, the aggregated posterior $q_{ϕ} (z)$ may not match $p (z) = N (0, I)$ at some regions. This mismatch creates “holes” in the latent space - regions with high prior probability but low posterior density - causing poor sample quality at generation time.

5. Conclusion

The Variational Autoencoder is a foundational generative model that elegantly combines probabilistic inference with deep learning. By replacing the deterministic bottleneck of a standard autoencoder with a learned posterior distribution, VAEs endow the latent space with a structured, continuous geometry that supports both generation and interpolation. The ELBO provides a tractable training objective that simultaneously encourages faithful reconstruction and regularizes the latent space toward a simple prior - a tension that lies at the heart of all latent-variable generative models.

That said, the Gaussian VAE is far from perfect. The four drawbacks discussed above - blurry reconstructions from the MSE objective, posterior collapse, limited posterior expressiveness from the diagonal Gaussian assumption, and aggregate posterior mismatch - are not merely implementation details; they are fundamental limitations that arise from the design choices made to keep the ELBO tractable.

What comes next? Two important lines of work build directly on these observations to solve standard VAE’s limitations:

Hierarchical VAEs (HVAEs) address the expressiveness problem by stacking multiple layers of stochastic latent variables. Rather than compressing $x$ into a single $z$ , HVAEs learn a hierarchy $z_{1}, z_{2}, \dots, z_{L}$ where each layer captures structure at a different level of abstraction. This allows the model to represent far richer posteriors, and the ELBO generalizes naturally to the hierarchical setting.
Denoising Diffusion Probabilistic Models (DDPMs) take a different philosophical path. Instead of learning a compact latent code, diffusion models define a fixed forward process that gradually corrupts data with Gaussian noise over $T$ steps, then learn to reverse this process step by step. Remarkably, this can be seen as a special case of a hierarchical latent-variable model where the encoder is fixed (the forward noising process) and only the decoder (the denoising network) is learned. This design sidesteps the blurry-reconstruction and posterior-collapse problems entirely - the fixed encoder cannot collapse, and the step-by-step denoising objective enforces sharp, high-frequency detail at each scale.

References

[1]— Kingma, D. P., & Welling, M. (2013). Auto-encoding variational Bayes.

[2]—The Principles of Diffusion Models. (n.d.). https://the-principles-of-diffusion-models.github.io/

[3]—Wikipedia contributors. (n.d.). Jensen’s inequality. Wikipedia. https://en.wikipedia.org/wiki/Jensen%27s_inequality

[4]—Wikipedia contributors. (n.d.). Monte Carlo method. Wikipedia. https://en.wikipedia.org/wiki/Monte_Carlo_method

[5]—Wikipedia contributors. (n.d.). Reparameterization trick. Wikipedia. https://en.wikipedia.org/wiki/Reparameterization_trick

Notations

See the notation reference for a summary of symbols used across all notes.

AI Acknowledgement

And yes, I had AI helped with wording and structure (Claude by Anthropic) (•‿•)

☕ 💻 ✍

Explorer

Variational AutoEncoder

1. Constructions of VAE

1.1 Decoder (Generator)

1.2 Encoder (Inference Model)

2. ELBO (Evidence Lower Bound)

3. Gaussian VAEs

3.1 The encoder part

3.2 The Decoder part

3.3 Overall Training Procedure

4. Drawbacks of Gaussian VAEs

4.1 Blurry reconstructions

4.2 Limited posterior expressiveness

4.3 Posterior Collapse

4.4 Mismatch between aggregate posterior and prior

5. Conclusion

References

Notes

Table of Contents

Related Notes