In a lot of generative models courses, the starting point of training a neural network to generate new realistic data is the Variational Autoencoder (VAE). This model has its origins in the AutoEncoder (AE), which serves a different purpose: to reconstruct data:
Figure 2: The standard autoencoder compresses input into a fixed latent code via the encoder, then reconstructs via the decoder. is a single point with no probabilistic structure.
Formally, AE consists of two parts: an encoder that compresses the input into a compact latent representation , and a decoder that reconstructs the input from that representation . The network is trained end-to-end by minimizing a reconstruction loss, typically mean squared error:
The bottleneck forces the encoder to learn a compressed, meaningful representation of the data. Once trained, the latent space can be used for tasks like dimensionality reduction or feature extraction. However, autoencoders have a critical limitation to serve as a generative model: the latent space has no guaranteed structure. Points in latent space are not organized in any principled way, so randomly sampling an arbitrary and decoding it often yields garbage. There is no way to smoothly interpolate between examples or generate novel, realistic samples.
The Variational Autoencoder (VAE), introduced by Kingma & Welling (2013), addresses this by imposing a probabilistic structure on the latent space. Instead of mapping to a fixed point , the encoder outputs the parameters of a distribution (usually a Gaussian). A latent vector is then sampled from this distribution rather than deterministically computed. The decoder learns to reconstruct from these sampled latents. A prior is placed over the latent space, and the encoder is regularized to stay close to this prior via KL divergence. This shift - from a deterministic bottleneck to a learned posterior - gives the latent space two important properties:
-
Continuity: Nearby points in latent space decode to similar outputs. Because the encoder maps each input to a distribution over rather than a single point, inputs that are similar naturally produce overlapping distributions - and thus neighboring regions in latent space correspond to similar decoded outputs.
-
Completeness: Any point sampled from the prior produces a meaningful output. By regularizing the encoder’s posterior to stay close to the prior , the model ensures that the high-probability regions of the latent space are densely covered with meaningful structure, so random samples from the prior reliably decode into coherent outputs.
Figure 2: The VAE encoder maps input to a Gaussian distribution in latent space, regularized toward the prior . A vector is then sampled from and decoded by to reconstruct .
1. Constructions of VAE
Suppose we have a dataset of samples drawn i.i.d. from an unknown, complex distribution . Since the true form of is unknown, we cannot generate new sample by drawing from directly. The goal of a generative model is to learn a tractable approximation from this finite dataset by minimizing a divergence between the two distributions.
Figure 3: The goal of a generative model: find parameters that minimize the divergence between the true data distribution (blue) and the learned model distribution (yellow). As , the model distribution increasingly overlaps with the data distribution.
Once the optimal parameters are found, can be used to serve as a proxy for , enabling two key capabilities:
- Generation: Draw new, realistic samples from via sampling methods such as Monte Carlo Sampling via .
- Evaluation: Assess how likely a given sample is under the learned distribution - for instance, judging whether an image looks realistic by computing the likelihood .
In the case of VAEs, is the KL divergence :
KL divergence intuition
As we can see, the KL divergence measures the expected log-likelihood difference between and . Therefore, minimizing pushes to assign high likelihood to real data sampled from .
Now, We can rewrite the KL divergence as follow:
The constant is simply the entropy of and is independent of . This is very convenient as is unknown and minimizing is equivalent to maximizing the expected log-likelihood of the data under :
Which is precisely the maximum likelihood estimation (MLE) objective. In practice we replace this population expectation by its Monte Carlo estimate, yielding the empirical MLE objective now becomes:
Where is the number of samples in the dataset. This objective is then optimized via SGD over minibatches.
1.1 Decoder (Generator)
Returning to the autoencoder setting, the goal is to generate a new sample from a latent variable via a neural network decoder . We can express the target distribution in equation (1) as the marginal distribution:
Unfortunately, directly optimizing this objective via MLE is intractable: it requires integrating over the entire high-dimensional latent space, and since is a deep, expressive neural network with no closed-form solution, evaluating this integral exactly is computationally infeasible. To make this optimization tractable, we need a way to focus only on latent states that are likely to have generated the current input , rather than integrating over the entire latent space.
1.2 Encoder (Inference Model)
We can reframe the problem: instead of integrating over all possible , can we identify which latent states are most likely to have produced the observed sample ? This leads us to consider the posterior distribution , which by Bayes’ rule is:
However, computing this posterior directly is equally intractable, as the denominator is the same marginal likelihood we started with. This motivates approximating the true posterior with a learned inference model:
And yes, this is exactly the encoder of the VAE! Which can be trained to concentrates probability mass on the state that is most relevant to .
2. ELBO (Evidence Lower Bound)
Now that we have a controllable encoder model to generate . We can redefine the MLE optimization goal using .
The learning objective is now tractable. Now according to Jensen’s inequality. We have the the evidence lower bound where:
Deriving further, we can see that consists of 2 terms:
-
Reconstruction term - : This is the reconstruction objective from the standard AE, but now evaluated only over sampled from the encoder, making it tractable.
-
Regularization term - : This penalizes the encoder’s posterior for deviating from the prior , enforcing the latent space structure needed for generation.
Why is the learning objective tractable now?
The original MLE objective is intractable because it requires integrating - a neural network - over the entire latent space. The ELBO resolves this in two key ways:
1. Replacing the integral with a tractable expectation: Instead of integrating over all , the reconstruction term only requires sampling from the encoder , which concentrates mass on the latent regions most relevant to .
2. A closed-form KL term: is usually modeled as a simple distribution, usually a gaussian and the KL divergence between two Gaussians has a closed-form solution - no integration is needed at all. And it can also be easily trainable via the reparameterization trick
Together, the two terms create a natural tension: maximizing encourages the decoder to recover the original input as accurately as possible from latent samples (reconstruction term), while the regularization term pulls the encoder’s posterior back toward the prior . The VAE learns by striking a balance between these two competing objectives.
ELBO as a Divergence Bound
So what is the relationship between ELBO and the true MLE goal ? Recall that maximum likelihood training amounts to minimizing the KL divergence between and the learned distribution :
Since this term is intractable in general, the variational framework of VAE introduces a joint comparison . Specifically, consider two join distributions
- Generative Join (Decoder):
- Inference Join (Encoder):
The total error bound is to match these join together is:
Thus we have
Where equality happens when inference error is zeros, which also means the encoder perfectly model the unknow posterior distribution .
Note that can also be rewritten as :
We can see that the gap between the true log-likelihood and the ELBO is precisely the inference error of the current sample . Maximizing the ELBO therefore directly reduces this gap. Specifically, optimizing the encoder tightens the bound by bringing the approximate posterior closer to the true one , while optimizing the decoder pushes the itself upward - lifting the entire lower bound and improving the overall log-likelihood.
3. Gaussian VAEs
The most common instantiation of the VAE framework is the Gaussian VAE, where the encoder, decoder and prior are modeled as Gaussians.
Figure 4: Overview of the Gaussian VAE. Each input is encoded into a class-conditional Gaussian (colored clusters). The aggregate posterior is matched to the isotropic prior via the KL term in the ELBO. Samples from are decoded by to produce reconstructions , whose marginal approximates the data distribution.
3.1 The encoder part
For each input , the encoder produces a Gaussian distribution centered at with variance , so that similar inputs yield overlapping distributions in the latent space:
This is the reparameterization trick: by expressing as a deterministic function of and a fixed noise variable , the stochasticity is separated from the parameters, making the sampling step differentiable and allowing gradients to flow back through to the encoder.
Since the prior is also Gaussian, the KL divergence between two Gaussians admits a closed-form solution - no numerical integration required:
With is the number of dimension of the latent space.
Derivation of closed-form KL loss
Since both and are diagonal, the KL factorizes over dimensions. It suffices to derive for a single scalar dimension vs :
Summing over all independent dimensions:
Taking the gradient of the KL term with respect to and we have:
Setting these to zero . Therefore minimizing the KL term alone pushes the encoder toward:
This is why the reconstruction term is essential: it pulls away from zero and toward smaller values to make informative about .
3.2 The Decoder part
To counteract collapse from the regularization term, the reconstruction term enforces that remains informative about . Specifically, the decoder is trained to output a sample that resembles the original input as closely as possible, given a latent vector drawn from the encoder’s posterior . Note that need not be identical to :
Here is the output of a neural network decoder, and is a fixed hyperparameter controlling the spread of the output distribution - large allows more deviation from the input, while small forces the reconstruction to stay close to the input . The reconstruction loss can now be rewritten as:
This is equivalent to minimizing the expected MSE between the input and the decoder output - which is similar to the original AE loss.
3.3 Overall Training Procedure
With both the encoder and decoder defined, the full training procedure follows directly from maximizing the ELBO. Each training step processes a minibatch of inputs:
4. Drawbacks of Gaussian VAEs
Despite its elegance, the Gaussian VAE has several well-known limitations:
4.1 Blurry reconstructions
Modeling as a Gaussian with a fixed variance corresponds to minimizing MSE, which tends to average over multiple plausible reconstructions using the sampled code .
Proof of blurry reconstructions in VAEs
Recall the per-sample reconstruction loss:
When training over the full dataset, we optimize its expectation over all :
Since only appears in the inner expectation and has no effect on the aggregate posterior , the outer expectation acts as a constant weight. It suffices to minimize the inner term with respect to for each fixed . Taking the gradient and setting it to zero:
The optimal decoder output is the conditional mean of given under the encoder’s inverse distribution . When multiple distinct images map to similar latent codes , the MSE loss forces the decoder to output their average - producing blurry reconstructions.
For image data, this produces blurry outputs rather than sharp, realistic samples.
So what could be done to improve this? - One natural idea is to make the encoder or decoder more expressive by adding more layers to capture richer, more complex latent structure. However, increasing depth alone does not solve the Limited posterior expressiveness problem of the encoder. Moreover, when the decoder grows too powerful, it can reconstruct without relying on at all, triggering posterior collapse.
4.2 Limited posterior expressiveness
The diagonal Gaussian assumption for restricts the variational family distribution to an axis-aligned ellipsoid (i.e., zero off-diagonal covariance). If the true posterior has complex, multimodal, or highly correlated structure, a single Gaussian cannot capture it - leading to a persistently loose ELBO bound regardless of the encoder capacity.
4.3 Posterior Collapse
Why does this happen? One key reason is that when the decoder becomes too expressive (a sufficiently deep neural network), at some point during training, for some , the decoder finds it easier to simply approximate the data distribution directly:
This sounds like a win for reconstruction quality, but it also breaks the encoder. In other words, the decoder has learned to ignore , the VAE’s output now barely changes no matter what latent code it receives. Recall the ELBO:
When the decoder is powerful enough to reconstruct without using , by exploiting its own internal structure, the reconstruction term becomes approximately constant with respect to . The gradient signal that would normally force the encoder to encode information into disappears. The optimizer then finds the path of least resistance: collapse the KL term to zero by driving . At this point, and become statistically independent, the latent code carries no information about the input, and the decoder can no longer be used to control the output. The VAE reduces to a decoder-only model. so the ELBO degenerates to:
Proof for the independence of and after postierior corruption
We can rewrite the regularization term, averaged over all , as:
This decomposition reveals what posterior collapse actually destroys. When the regularization term is driven to zero, both components must vanish simultaneously:
- - The latent code becomes statistically independent of (recall independent property), i.e. .
- - The aggregate posterior collapses to the isotropic Gaussian . The latent space loses all input class-specific structure, the per-class mixture components that distinguish different types of are wiped out.
4.4 Mismatch between aggregate posterior and prior
Another weakness of standard VAEs is that even if each individual posterior is close to the prior, the aggregated posterior may not match at some regions. This mismatch creates “holes” in the latent space - regions with high prior probability but low posterior density - causing poor sample quality at generation time.
5. Conclusion
The Variational Autoencoder is a foundational generative model that elegantly combines probabilistic inference with deep learning. By replacing the deterministic bottleneck of a standard autoencoder with a learned posterior distribution, VAEs endow the latent space with a structured, continuous geometry that supports both generation and interpolation. The ELBO provides a tractable training objective that simultaneously encourages faithful reconstruction and regularizes the latent space toward a simple prior - a tension that lies at the heart of all latent-variable generative models.
That said, the Gaussian VAE is far from perfect. The four drawbacks discussed above - blurry reconstructions from the MSE objective, posterior collapse, limited posterior expressiveness from the diagonal Gaussian assumption, and aggregate posterior mismatch - are not merely implementation details; they are fundamental limitations that arise from the design choices made to keep the ELBO tractable.
What comes next? Two important lines of work build directly on these observations to solve standard VAE’s limitations:
-
Hierarchical VAEs (HVAEs) address the expressiveness problem by stacking multiple layers of stochastic latent variables. Rather than compressing into a single , HVAEs learn a hierarchy where each layer captures structure at a different level of abstraction. This allows the model to represent far richer posteriors, and the ELBO generalizes naturally to the hierarchical setting.
-
Denoising Diffusion Probabilistic Models (DDPMs) take a different philosophical path. Instead of learning a compact latent code, diffusion models define a fixed forward process that gradually corrupts data with Gaussian noise over steps, then learn to reverse this process step by step. Remarkably, this can be seen as a special case of a hierarchical latent-variable model where the encoder is fixed (the forward noising process) and only the decoder (the denoising network) is learned. This design sidesteps the blurry-reconstruction and posterior-collapse problems entirely - the fixed encoder cannot collapse, and the step-by-step denoising objective enforces sharp, high-frequency detail at each scale.
References
[1]— Kingma, D. P., & Welling, M. (2013). Auto-encoding variational Bayes.
[2]—The Principles of Diffusion Models. (n.d.). https://the-principles-of-diffusion-models.github.io/
[3]—Wikipedia contributors. (n.d.). Jensen’s inequality. Wikipedia. https://en.wikipedia.org/wiki/Jensen%27s_inequality
[4]—Wikipedia contributors. (n.d.). Monte Carlo method. Wikipedia. https://en.wikipedia.org/wiki/Monte_Carlo_method
[5]—Wikipedia contributors. (n.d.). Reparameterization trick. Wikipedia. https://en.wikipedia.org/wiki/Reparameterization_trick
Notations
See the notation reference for a summary of symbols used across all notes.
AI Acknowledgement
And yes, I had AI helped with wording and structure (Claude by Anthropic) (•‿•)