Variational Lossy Autoencoder
Overview
The Variational Lossy Autoencoder (VLAE) by Chen, Kingma, Salimans, Duan, Dhariwal, Schulman, Sutskever, and Abbeel (2016) addresses the fundamental tension in VAE design: when a sufficiently powerful autoregressive decoder is used, the model can explain all the data through p(x_i | x_{<i}) alone, leaving the latent code z completely unused -- a failure mode known as posterior collapse.
The key insight is to use the bits-back argument from information theory to understand what z must encode. When the decoder is autoregressive (e.g., PixelRNN/CNN, WaveNet, MADE), it can model local spatial correlations in x from context alone. The latent code z is only needed for information that the autoregressive decoder cannot infer from local context -- i.e., global structure like object identity, style, and layout. By restricting the decoder's receptive field (for example, using a PixelCNN that only conditions on spatially local context rather than the full preceding sequence), the model is forced to route non-local, global information through z.
This creates a local/global information decomposition: the autoregressive decoder handles local texture and fine detail, while z captures global semantics. The KL term in the ELBO controls how many bits flow through z, making this a principled rate-distortion framework -- hence the name "lossy autoencoder." The latent variables remain continuous Gaussian variables as in a standard VAE; there is no discrete quantization.
VLAE achieved state-of-the-art results on MNIST, OMNIGLOT, and Caltech-101 Silhouettes density estimation at the time of publication, and also reports competitive bits-per-dimension on CIFAR-10.
Key Contributions
- Lossy compression interpretation of VAEs: Reframes the ELBO as a rate-distortion objective where KL divergence controls rate (bits through z) and reconstruction loss controls distortion, giving a principled account of what z should encode
- Bits-back / information preference argument: Formally shows that when a decoder can model local context autoregressively, it will prefer to do so; z is only used for global information the decoder cannot recover from local context -- this determines what a VAE with an AR decoder will learn to encode in z
- Local/global information decomposition: By choosing the autoregressive decoder's receptive field (local vs. global), the practitioner controls the information split: a local receptive field forces global info into z; a global receptive field leads to posterior collapse
- Autoregressive prior p(z): In addition to an autoregressive decoder, VLAE also explores using an autoregressive model as the prior p(z), enabling richer latent structure than a factored Gaussian prior
- Continuous latent variables: VLAE uses standard continuous Gaussian latents with the reparameterization trick -- it is not a discrete/VQ approach; the bottleneck is controlled through the KL weight and decoder expressiveness, not quantization
Architecture / Method
┌──────────────────────────────────────────────────────────┐
│ Variational Lossy Autoencoder │
│ │
│ Input x │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Encoder │ CNN / RNN │
│ │ q(z | x) │ → μ, σ (continuous Gaussian) │
│ └──────┬───────┘ │
│ │ z ~ N(μ, σ²) (reparameterization trick) │
│ ▼ │
│ ┌──────────────┐ │
│ │ Latent z │ ◄── KL(q(z|x) || p(z)) │
│ │ (continuous) │ controls rate (bits through z) │
│ └──────┬───────┘ │
│ │ Global info: what AR decoder cannot infer │
│ │ from local context (identity, style, layout) │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ Autoregressive Decoder │ │
│ │ p(x_i | x_{<i, local}, z) │ │
│ │ ┌────────────────────────┐ │ │
│ │ │ PixelCNN / WaveNet │ │ │
│ │ │ Limited receptive │ │ │
│ │ │ field (local only) │ │ │
│ │ │ + Global cond. on z │ │ │
│ │ └────────────────────────┘ │ │
│ └──────────────┬───────────────────┘ │
│ ▼ │
│ Reconstruction x_hat │
│ │
│ Info split: z = global structure (forced by local AR) │
│ AR decoder = local texture/detail │
│ ELBO: E_q[log p(x|z)] - KL(q(z|x) || p(z)) │
└──────────────────────────────────────────────────────────┘
The encoder is a CNN that maps input x to the parameters (mean and variance) of a Gaussian posterior q(z|x). Sampling uses the standard reparameterization trick: z = mu + sigma * epsilon, epsilon ~ N(0, I). There is no discrete quantization.
The latent bottleneck z is a continuous Gaussian code. The KL divergence KL(q(z|x) || p(z)) penalizes the amount of information flowing through z. The paper also experiments with an autoregressive prior p(z) (rather than a factored Gaussian), which allows richer structure in the latent space.
The autoregressive decoder models p(x|z) = product_i p(x_i | x_{<i, local}, z). The crucial design choice is the receptive field: if the AR decoder can see all previous pixels globally, it can reconstruct x without z (posterior collapse). By restricting the decoder to only local context (nearby pixels), global structure cannot be inferred from context alone and must flow through z. The latent z is injected as a global conditioning signal (e.g., additive bias to all layers).
The training objective is the standard VAE ELBO: E_q[log p(x|z)] - KL(q(z|x) || p(z)). No beta-weighting or modified loss is required -- the information allocation emerges naturally from the decoder's architectural constraints.
Results
- State-of-the-art density estimation: Achieves best reported bits-per-dimension on MNIST, OMNIGLOT, and Caltech-101 Silhouettes at time of publication
- Competitive on CIFAR-10: Reports competitive bits-per-dim on CIFAR-10, matching or exceeding pure autoregressive baselines while providing a structured latent space
- Posterior collapse resolved by design: Restricting decoder receptive field ensures z is actively used; ablations show that using a global AR decoder causes collapse while a local AR decoder does not
- Information decomposition validated: Latent code captures global attributes (object class, identity) while the AR decoder captures local texture; removing z degrades global coherence while local quality is maintained
- Autoregressive prior improves results: Using an AR model for p(z) (instead of factored Gaussian) yields further improvements in log-likelihood
Limitations & Open Questions
- The choice of decoder receptive field is a key architectural hyperparameter that is task-specific; too large collapses z, too small produces poor local quality
- Using an autoregressive decoder makes generation slow (sequential sampling), creating a speed/quality tradeoff compared to VAEs with factored decoders
- The framework focuses on density estimation and does not directly address disentanglement or controllable generation, which require additional constraints beyond the ELBO