In a lot of generative models courses, the starting point of training a neural network to generate new realistic data is the Variational Autoencoder (VAE). This model has its origins in the AutoEncoder (AE), which serves a different purpose: to reconstruct data:

Formally, AE consists of two parts: an encoder that compresses the input into a compact latent representation , and a decoder that reconstructs the input from that representation . The network is trained end-to-end by minimizing a reconstruction loss, typically mean squared error:

The bottleneck forces the encoder to learn a compressed, meaningful representation of the data. Once trained, the latent space can be used for tasks like dimensionality reduction or feature extraction. However, autoencoders have a critical limitation to serve as a generative model: the latent space has no guaranteed structure. Points in latent space are not organized in any principled way, so randomly sampling an arbitrary and decoding it often yields garbage. There is no way to smoothly interpolate between examples or generate novel, realistic samples.

The Variational Autoencoder (VAE), introduced by Kingma & Welling (2013), addresses this by imposing a probabilistic structure on the latent space. Instead of mapping to a fixed point , the encoder outputs the parameters of a distribution (usually a Gaussian). A latent vector is then sampled from this distribution rather than deterministically computed. The decoder learns to reconstruct from these sampled latents. A prior is placed over the latent space, and the encoder is regularized to stay close to this prior via KL divergence. This shift — from a deterministic bottleneck to a learned posterior — gives the latent space two important properties:

  • Continuity — Nearby points in latent space decode to similar outputs. Because the encoder maps each input to a distribution over rather than a single point, inputs that are similar naturally produce overlapping distributions — and thus neighboring regions in latent space correspond to similar decoded outputs.

  • Completeness — Any point sampled from the prior produces a meaningful output. By regularizing the encoder’s posterior to stay close to the prior , the model ensures that the high-probability regions of the latent space are densely covered with meaningful structure, so random samples from the prior reliably decode into coherent outputs.

1. Constructions of VAE

Suppose we have a dataset of samples drawn i.i.d. from an unknown distribution . Since the true form of is unknown, we cannot sample from it directly. The goal of a generative model is to learn a tractable approximation from this finite dataset by minimizing a divergence between the two distributions. In the case of VAEs, is the KL divergence :

Once the optimal parameters are found, can be used to serve as a proxy for , enabling two key capabilities:

  • Generation: Draw new, realistic samples from via sampling methods such as Monte Carlo Sampling via .
  • Evaluation: Assess how likely a given sample is under the learned distribution — for instance, judging whether an image looks realistic by computing the likelihood .

Now having the target to optimize. We can rewrite the KL divergence as follow:

The constant is simply the entropy of and is independent of . This is very convenient as is unknown and minimizing is equivalent to maximizing the expected log-likelihood of the data under :

Which is precisely the maximum likelihood estimation (MLE) objective. In practice we replace this population expectation by its Monte Carlo estimate, yielding the empirical MLE objective now becomes:

Where is the number of samples in the dataset. This objective is then optimized via SGD over minibatches.

1.1 Decoder (Generator)

Returning to the autoencoder setting, the goal is to generate a new sample from a latent variable via a neural network decoder . We can express the target distribution as the marginal distribution:

Unfortunately, directly optimizing this objective via MLE is intractable: it requires integrating over the entire high-dimensional latent space, and since is a deep, expressive neural network with no closed-form solution, evaluating this integral exactly is computationally infeasible. To make this optimization tractable, we need a way to focus only on latent states that are likely to have generated the current input , rather than integrating over the entire latent space.

1.2 Encoder (Inference Model)

We can reframe the problem: instead of integrating over all possible , can we identify which latent states are most likely to have produced the observed sample ? This leads us to consider the posterior distribution , which by Bayes’ rule is:

However, computing this posterior directly is equally intractable, as the denominator is the same marginal likelihood we started with. This motivates approximating the true posterior with a learned inference model:

And yes, this is exactly the encoder of the VAE! Which can be trained to concentrates probability mass on the state that is most relevant to .

2. ELBO (Evidence Lower Bound)

Now that we have a controllable encoder model to generate . We can redefine the MLE optimization goal using .

The learning objective is now tractable. Now according to Jensen’s inequality. We have the the evidence lower bound where:

Deriving further, we can see that consists of 2 terms:

  • Reconstruction term: This is the reconstruction objective from the standard AE, but now evaluated only over sampled from the encoder, making it tractable.

  • Regularization term: This penalizes the encoder’s posterior for deviating from the prior , enforcing the latent space structure needed for generation.

Why is the learning objective tractable now?

The original MLE objective is intractable because it requires integrating — a neural network — over the entire latent space. The ELBO resolves this in two key ways:

1. Replacing the integral with a tractable expectation: Instead of integrating over all , the reconstruction term only requires sampling from the encoder , which concentrates mass on the latent regions most relevant to .

2. A closed-form KL term: is usually modeled as a simple distribution, usually a gaussian and the KL divergence between two Gaussians has a closed-form solution — no integration is needed at all. And it can also be easily trainable via the reparameterization trick

Together, the two terms create a natural tension: maximizing encourages the decoder to recover the original input as accurately as possible from latent samples (reconstruction term), while the regularization term pulls the encoder’s posterior back toward the prior . The VAE learns by striking a balance between these two competing objectives.

ELBO as a Divergence Bound

So what is the relationship between ELBO and the true MLE goal ? Recall that maximum likelihood training amounts to minimizing the KL divergence between and the learned distribution :

Since this term is intractable in general, the variational framework of VAE introduces a joint comparison . Specifically, consider two joint distributions

  • Generative Join — Decoder:
  • Inference Join — Encoder:

The total error bound is to match these join together is:

Thus we have

Where equality happens when inference error is zeros, which also means the encoder perfectly model the unknow posterior distribution .

Note that can also be rewritten as :

We can see that the gap between the true log-likelihood and the ELBO is precisely the inference error of the current sample . Maximizing the ELBO therefore directly reduces this gap. Specifically, optimizing the encoder tightens the bound by bringing the approximate posterior closer to the true one , while optimizing the decoder pushes the itself upward — lifting the entire lower bound and improving the overall log-likelihood.

3. Gaussian VAEs

The most common instantiation of the VAE framework is the Gaussian VAE, where the encoder, decoder and prior are modeled as Gaussians.

3.1 The encoder part

For each input , the encoder produces a Gaussian distribution centered at with variance , so that similar inputs yield overlapping distributions in the latent space:

This is the reparameterization trick: by expressing as a deterministic function of and a fixed noise variable , the stochasticity is separated from the parameters, making the sampling step differentiable and allowing gradients to flow back through to the encoder.

Since the prior is also Gaussian, the KL divergence between two Gaussians admits a closed-form solution — no numerical integration required:

With is the number of dimension of the latent space.

Taking the gradient of the KL term with respect to and we have:

Setting these to zero . Therefore minimizing the KL term alone pushes the encoder toward:

This is why the reconstruction term is essential: it pulls away from zero and toward smaller values to make informative about .

3.2 The Decoder part

To counteract collapse from the regularization term, the reconstruction term enforces that remains informative about . Specifically, the decoder is trained to output a sample that resembles the original input as closely as possible, given a latent vector drawn from the encoder’s posterior . Note that need not be identical to :

Here is the output of a neural network decoder, and is a fixed hyperparameter controlling the spread of the output distribution — large allows more deviation from the input, while small forces the reconstruction to stay close to the input . The reconstruction loss can now be rewritten as:

This is equivalent to minimizing the expected MSE between the input and the decoder output — which is similar to the original AE loss.

3.3 Overall Training Procedure

With both the encoder and decoder defined, the full training procedure follows directly from maximizing the ELBO. Each training step processes a minibatch of inputs:

4. Drawbacks of Gaussian VAEs

Despite its elegance, the Gaussian VAE has several well-known limitations:

4.1 Blurry reconstructions

Modeling as a Gaussian with a fixed variance corresponds to minimizing MSE, which tends to average over multiple plausible reconstructions.

Proof

Recall the per-sample reconstruction loss:

When training over the full dataset, we optimize its expectation over all :

Since only appears in the inner expectation and has no effect on the aggregate posterior , the outer expectation acts as a constant weight. It suffices to minimize the inner term with respect to for each fixed . Taking the gradient and setting it to zero:

The optimal decoder output is the conditional mean of given under the encoder’s inverse distribution . When multiple distinct images map to similar latent codes , the MSE loss forces the decoder to output their average — producing blurry reconstructions.

For image data, this produces blurry outputs rather than sharp, realistic samples.

4.2 Limited posterior expressiveness

The diagonal Gaussian assumption for restricts the approximate posterior to an axis-aligned ellipsoid (i.e., zero off-diagonal covariance). If the true posterior has complex, multimodal, or highly correlated structure, a single Gaussian cannot capture it — leading to a persistently loose ELBO bound regardless of encoder capacity.

4.3 Mismatch between aggregate posterior and prior

Even if each individual posterior is close to the prior, the aggregate posterior may not match . This mismatch creates “holes” in the latent space — regions with high prior probability but low posterior density — causing poor sample quality at generation time.

These limitations motivate more expressive extensions, such as Hierarchical VAEs, which stack multiple layers of latent variables to capture richer structure.

5. Conclusion

The Variational Autoencoder is a foundational generative model that elegantly combines probabilistic inference with deep learning. By replacing the deterministic bottleneck of a standard autoencoder with a learned posterior distribution, VAEs endow the latent space with a structured, continuous geometry that supports both generation and interpolation. The ELBO provides a tractable training objective that simultaneously encourages faithful reconstruction and regularizes the latent space toward a simple prior — a tension that lies at the heart of all latent-variable generative models.

That said, the Gaussian VAE is far from perfect. The three drawbacks discussed above — blurry reconstructions from the MSE objective, limited posterior expressiveness from the diagonal Gaussian assumption, and aggregate posterior mismatch — are not merely implementation details; they are fundamental limitations that arise from the design choices made to keep the ELBO tractable.

What comes next? Two important lines of work build directly on these observations:

  • Hierarchical VAEs (HVAEs) address the expressiveness problem by stacking multiple layers of stochastic latent variables. Rather than compressing into a single , HVAEs learn a hierarchy where each layer captures structure at a different level of abstraction. This allows the model to represent far richer posteriors, and the ELBO generalizes naturally to the hierarchical setting.

  • Denoising Diffusion Probabilistic Models (DDPMs) take a different philosophical path. Instead of learning a compact latent code, diffusion models define a fixed forward process that gradually corrupts data with Gaussian noise over steps, then learn to reverse this process step by step. Remarkably, this can be seen as a special case of a hierarchical latent-variable model where the encoder is fixed (the forward noising process) and only the decoder (the denoising network) is learned. This design sidesteps the blurry-reconstruction and posterior-collapse problems entirely — the fixed encoder cannot collapse, and the step-by-step denoising objective enforces sharp, high-frequency detail at each scale.


References

[1]— Kingma, D. P., & Welling, M. (2013). Auto-encoding variational Bayes.

[2]—The Principles of Diffusion Models. (n.d.). https://the-principles-of-diffusion-models.github.io/

[3]—Wikipedia contributors. (n.d.). Jensen’s inequality. Wikipedia. https://en.wikipedia.org/wiki/Jensen%27s_inequality

[4]—Wikipedia contributors. (n.d.). Monte Carlo method. Wikipedia. https://en.wikipedia.org/wiki/Monte_Carlo_method

[5]—Wikipedia contributors. (n.d.). Reparameterization trick. Wikipedia. https://en.wikipedia.org/wiki/Reparameterization_trick