Your EMA decay is a hyperparameter, not a convention

8 minute read

Published:

· 6 min read

Your EMA decay is a hyperparameter, not a convention

Almost every modern diffusion-model paper uses an exponential moving average of the weights and reports numbers from the EMA checkpoint. Almost none of them tune the EMA decay — they just inherit 0.9999 from whichever paper they were forked from. We dug into this default and found that it's quietly trading recall for precision, and that the "best" EMA depends heavily on which training stage you're at.

TL;DR. EMA decay isn't just a smoother. It's a knob that shifts the model along a precision–recall tradeoff. A larger decay sharpens samples but suppresses rare modes (a soft kind of mode collapse). The decay that's optimal at 80 epochs is not the one that's optimal at 800. The community-default 0.9999 is rarely the right choice, especially when methods are compared partway through training.

Why care about EMA decay at all?

Diffusion models are expensive, so most empirical comparisons happen before full convergence — at 80 epochs, 100k iterations, "results so far." In that regime, the choice of EMA decay isn't a finishing touch; it's part of the measurement instrument. If two methods get different decays "by convention," the ranking you read off the table partly reflects how each model interacted with that decay, not just the underlying training objective.

We benchmark three diffusion models that span the typical input-space spectrum — JiT1 (pixel space), SiT2 (latent space), and RAE3 (representation space) — and for each one we run a full EMA sweep: {0.5, 0.9, 0.99, 0.999, 0.9993, 0.9995, 0.9997, 0.9999, 0.99999, 0.999999}, plus the raw online checkpoint, at multiple training stages. Same training pipeline, same sampling protocol; only the decay changes.

Finding 1: EMA decay slides you along a precision–recall curve

Here's what the RAE sweep looks like at 80 epochs. Every row uses the same checkpoint — they only differ in which EMA decay we apply on top of it.

EMA decayFID4IS5Precision6Recall ↑
raw (no EMA)4.46159.60.6840.607
0.54.38160.00.6800.611
0.93.88164.30.6870.615
0.993.33179.30.7090.604
0.9993.20200.00.7410.585
0.99933.24204.50.7440.586
0.99953.28207.60.7490.579
0.99973.38212.30.7580.569
0.9999 (community default)4.16234.80.7870.531
0.99999444.781.230.0000.000
0.999999328.031.220.0000.000

Blue row = best FID. Orange row = the community-default 0.9999. Red rows = full collapse.

Three things jump out:

  1. The best-FID, best-precision, and best-recall settings are different EMA values. Recall peaks at 0.9, FID at 0.999, IS and precision at 0.9999. There is no single decay that wins on every metric.
  2. The community default of 0.9999 is not the FID-optimal choice at this training stage. It gives the highest precision and the highest IS, but FID is worse than every decay between 0.99 and 0.9997 — because recall has dropped from 0.61 to 0.53.
  3. Pushing the decay further (0.99999, 0.999999) is a cliff. The averaged model becomes so stale that the generative distribution falls off the data manifold entirely.

In other words: EMA isn't only smoothing optimization noise. It's also dialing the model along a fidelity–coverage tradeoff. Reporting only quality-flavored metrics (FID, IS, precision) systematically hides half of what EMA is doing.

Finding 2: EMA changes the ranking, not just the score

Because the precision–recall tradeoff is real, two models that look essentially tied under their raw checkpoints can separate once you apply different EMA decays — and vice versa. Conversely, a method that looks "best" under 0.9999 may stop being best when you sweep the decay. We've seen rankings flip more than once across our experiments. So if a benchmark fixes EMA to 0.9999 for everyone, part of the resulting ordering is an EMA artefact, not an architecture or objective advantage.

Our practical recommendation is mild but firm: at minimum, papers should report the EMA decay they used. Ideally, comparisons should either tune EMA per method under a shared protocol, or include the raw-checkpoint numbers alongside EMA ones so the smoothing effect can be disentangled from the underlying method.

Finding 3: the optimal EMA depends on the training stage

The most consequential of our findings: there is no single EMA decay that is optimal across the whole training trajectory. Early on, when the online model is changing rapidly, a long EMA horizon is averaging over qualitatively different model states — it doesn't stabilise the current model so much as mix in stale behaviour. Later in training, once optimization is more stable, longer horizons start to behave like genuine smoothers and become helpful.

Empirically this means the metric-optimal EMA decay drifts as training proceeds, and the gap between "stage-aware decay" and "fixed 0.9999" is largest precisely at the partial-training stages where most empirical comparisons are made.

What does this look like? A 2D toy

To make the effect visible, we ran a controlled experiment on a 2D tree-structured Gaussian mixture (from autoguidance7). The ground truth is a hierarchical tree of branches; some branches are dense and "popular," others are sparse "tail" branches. We trained a small MLP score model, then compared the raw model with EMA versions at decays 0.99 / 0.997 / 0.998 / 0.999.

Soft over-concentration figure
Soft mode collapse, visualised. Top: final samples overlaid on the true tree. Bottom: trajectories from a fixed seed grid, with tail-branch paths highlighted. As the EMA decay grows, samples drift toward the dominant branches and tail branches lose coverage. At decay 0.999 the averaged model becomes stale enough that trajectories leave the manifold entirely.

Zooming in on the raw model vs EMA 0.998:

Raw model
Raw model samples on the tree distribution
EMA 0.998
EMA 0.998 samples on the tree distribution
Same seeds, same training run. EMA 0.998 produces samples that hug the main trunk and visibly under-cover the nearby tail branches — even though the underlying training is identical.

This isn't a hard collapse to a single mode. It's something subtler — a soft over-concentration toward the dominant branches, with rare branches gradually under-served. On big image benchmarks, the same mechanism shows up as cleaner-looking samples with lower recall: typical patterns get sharpened, rare configurations quietly disappear.

A mechanistic intuition

EMA averages weights over time, which means it preferentially retains structures that are stable over time. Frequent, high-density modes are reinforced; rare, fine-grained, still-evolving structures get attenuated. Pair that with a long horizon (large decay) and a non-stationary optimization trajectory, and you get soft over-concentration almost by construction.

Phrased that way, the precision–recall tradeoff and the soft mode collapse are two views of the same phenomenon, not two independent effects.

What about post-hoc EMA?

If the best decay drifts during training, the natural escape hatch is post-hoc EMA: keep dense checkpoints, then reconstruct any EMA horizon offline at evaluation time. We tried this. It's a useful approximation — but not a free one. Post-hoc reconstruction only has access to a discrete set of checkpoints; between them, the model's parameters are unknown. When the training trajectory is rapidly non-stationary (early training, learning rare structures), the approximation is noticeably worse than the corresponding online EMA. The accuracy of post-hoc EMA is, ironically, controlled by exactly the same non-stationarity that makes choosing online EMA difficult in the first place.

What we'd like to see going forward

  • Report EMA decay in diffusion-model papers, the way we report learning rate and batch size.
  • Sweep EMA per method when running a benchmark, or at the very least include raw-checkpoint numbers.
  • Pair quality metrics with coverage metrics. FID alone hides the recall side of the tradeoff.
  • Treat EMA as stage-dependent, not as a fixed convention inherited from upstream papers.

The paper has the full results, including the per-stage EMA sensitivity curves and the inter-model ranking flips. We'll link the arXiv version here once it's up.

References

  1. Li & He. Back to Basics: Let Denoising Generative Models Denoise. arXiv 2025. arXiv:2511.13720. ↩︎
  2. Ma, Goldstein, Albergo, Boffi, Vanden-Eijnden, & Xie. SiT: Exploring Flow- and Diffusion-based Generative Models with Scalable Interpolant Transformers. ECCV 2024. ↩︎
  3. Zheng, Ma, Tong, & Xie. Diffusion Transformers with Representation Autoencoders. arXiv 2025. arXiv:2510.11690. ↩︎
  4. Heusel, Ramsauer, Unterthiner, Nessler, & Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. NeurIPS 2017. (FID.) ↩︎
  5. Salimans, Goodfellow, Zaremba, Cheung, Radford, & Chen. Improved Techniques for Training GANs. NeurIPS 2016. (Inception Score.) ↩︎
  6. Kynkäänniemi, Karras, Laine, Lehtinen, & Aila. Improved Precision and Recall Metric for Assessing Generative Models. NeurIPS 2019. ↩︎
  7. Karras, Aittala, Kynkäänniemi, Lehtinen, Aila, & Laine. Guiding a Diffusion Model with a Bad Version of Itself. NeurIPS 2024. ↩︎
← Back to blog