← Yifei Wang Preprint · 2026

Taming Outlier Tokens in Diffusion Transformers

Xiaoyu Wu1* Yifei Wang1* Tsu-Jui Fu2 Liang-Chieh Chen2 Zhe Gan2 Chen Wei1
1Rice University    2Apple
*Equal contribution

TL;DR

Outlier tokens — a handful of high-norm patch tokens that absorb attention but carry little local information — are not just a ViT recognition phenomenon. They also appear in the encoders and the denoisers of modern RAE-style diffusion pipelines, and they hurt generation quality.

DSR introduces register tokens at both stages — the ViT-based tokenizer and the diffusion transformer. It cuts ImageNet-256 FID for RAE-DiT-XL (SigLIP2-B) from 5.89 → 4.58 at 80 epochs and raises GenEval on a large-scale text-to-image task from 0.426 → 0.466, with consistent gains across SiT, JiT, and RAE designs.

Where do outlier tokens hide in a DiT pipeline?

A modern RAE-style pipeline has two transformer stages — a ViT encoder that produces the latent representation, and a diffusion transformer that denoises in that space. We observe outlier tokens in both, but with distinct patterns.

Outlier tokens in the SigLIP2-B encoder
Outliers in the ViT encoder. Token-norm maps across layers of SigLIP2-B. Severe high-norm tokens emerge in the last few layers; the penultimate layer shows the strongest pattern, while the final output is somewhat more stable — likely a side-effect of SigLIP2's reconstruction-flavored objective.
Outlier tokens in the diffusion transformer across noise levels and layers
Outliers in the diffusion transformer. Norm maps of RAE-DiT (SigLIP2-B) across diffusion noise scales and DiT layers. Unlike standard ViTs, high-norm tokens concentrate in intermediate layers, and their severity drops as the noise level increases — suggesting that encoder anomalies are amplified by the denoising objective.

Why masking the loss doesn't fix it

A natural hypothesis is that the issue is just a few extreme-loss tokens. We tested this with a token-level loss mask that drops tokens whose representation norms exceed a threshold. As shown below, masking does not improve generation. We take this as evidence that outliers are a symptom of corrupted local patch semantics, not the root cause.

Training strategy % tokens filtered FID ↓ IS ↑ Prec ↑ Rec ↑
RAE-DiT-XL (SigLIP2-B)0%5.89156.540.6860.562
  + loss masking (τ = 100)0.1%6.06152.720.6860.562

Dual-Stage Registers (DSR)

If outliers reflect degraded local patch semantics, then we want a mechanism that absorbs global, sink-like behavior without contaminating patch tokens. Register tokens do exactly that — and we apply them on both sides of the pipeline.

DSR framework overview
Dual-Stage Registers (DSR). We patch both sides of the diffusion pipeline with register tokens: a test-time register in the ViT encoder, and 36 trained registers in the diffusion transformer. Encoder and diffusion register outputs are discarded before downstream use, leaving only patch tokens.

Encoder-side registers

When trained registers are available (e.g. DINOv2 with registers), we use them directly. They visibly suppress high-norm artifacts and consistently improve downstream DiT generation:

Norm maps with and without trained DINOv2 registers
Trained encoder registers cleanly remove outliers. RAE-DiT (DINOv2-B) with and without trained registers, at a fixed diffusion timestep t = 0.5. Registers consistently suppress high-norm tokens and improve patch-level representations.

For encoders without trained registers (e.g. SigLIP2), we use test-time registers (TTR) — extra tokens appended at inference time. For SigLIP2-So400, where we find two distinct outlier populations, we apply TTR recursively: stabilize the encoder, then apply TTR again on the resulting representation.

StrategyFID ↓IS ↑Prec ↑Rec ↑
RAE-DiT-XL (SigLIP2-B)5.89156.540.6860.562
  + test-time register4.63177.200.7480.542
RAE-DiT-XL (SigLIP2-So400)7.04167.010.6820.515
  + test-time register6.66166.880.6870.527
  + recursive test-time register6.48163.350.6840.531

Diffusion-side registers

Even after fixing the encoder, the DiT itself still develops outlier tokens in its intermediate layers. We add a small number of trainable diffusion registers — learned jointly with the generator and discarded at inference. With them, internal outliers largely disappear and PCA structure becomes cleaner:

Norm maps for baseline, encoder-only DSR, and full DSR
Outliers are only fully suppressed when both sides are patched. Baseline vs. encoder-only test-time registers vs. full DSR (encoder + diffusion registers).
PCA maps for baseline, encoder-only DSR, and full DSR
PCA maps. Encoder-side TTR already produces a visible cleanup in PCA structure; adding diffusion registers brings a further small improvement.

Results

Diffusion registers help across input spaces

Diffusion registers improve every input-space variant we tried — pixel-space (JiT), VAE latents (SiT), and multiple representation encoders. The gain is not tied to any particular tokenizer family.

MethodFID ↓IS ↑Prec ↑Rec ↑
RAE-DiT-XL (DINOv2-B, w/ encoder reg)4.11226.440.7750.529
  + diffusion reg3.92226.920.7730.542
RAE-DiT-XL (SigLIP2-B)5.89156.540.6860.562
  + diffusion reg5.33166.200.7020.556
RAE-DiT-XL (SigLIP2-B, w/ TTR)4.63177.200.7480.542
  + diffusion reg4.58165.990.7250.560
VAE-SiT-XL16.0570.110.5500.647
  + diffusion reg14.4778.500.5540.651
JiT-H30.3422.340.4240.621
  + diffusion reg23.1426.360.4750.611

ImageNet-256: class-conditional generation

Headline numbers on ImageNet-1K at 256×256. DSR consistently improves RAE-DiT and, when combined with the DDT head, reaches strong competitive numbers at a fraction of the training budget.

Method Epochs #Params gFID ↓ (no CFG) gFID ↓ (w/ CFG)
Pixel diffusion
ADM400554M10.943.94
JiT-H/16600953M1.86
Latent diffusion (VAE)
SiT-XL1400675M8.612.06
REPA800675M5.781.29
REPA-E800675M1.701.15
RAE-DiTDH-XL (DINOv2-B)800839M1.511.13
Latent diffusion with multimodal encoder (SigLIP2-B)
RAE-DiT-XL80676M5.89
RAE-DiT-XL800676M3.853.58
RAE-DiT-XL + DSR80676M4.58
RAE-DiT-XL + DSR800676M3.262.97
RAE-DiTDH-XL800839M2.912.77
RAE-DiTDH-XL + DSR800839M2.722.62
FID vs. training epoch — DSR reaches the baseline quality with 4× fewer epochs
DSR also trains faster. FID vs. epoch on ImageNet-256 — DSR reaches comparable quality with 4× fewer epochs than the RAE baseline.

Scaling across model sizes

DSR improves gFID at every DiT scale we tried, with only ~10% added GFLOPs.

ModelFID ↓IS ↑Prec ↑Rec ↑GFLOPs
RAE-DiT-S (SigLIP2-B)28.0366.150.2770.30212.4
  + DSR23.9363.860.4980.47013.7
RAE-DiT-B (SigLIP2-B)20.3678.760.5390.49346.6
  + DSR9.81110.230.6370.54351.2
RAE-DiT-XL (SigLIP2-B)5.89156.540.6860.562238.1
  + DSR4.58165.990.7250.560262.9
All rows trained for 100k iterations under matched protocol.

Text-to-image (Scale-RAE)

On Scale-RAE — a SigLIP2-B + MetaQuery T2I pipeline trained on 24.7M FLUX.1-schnell synthetic images — DSR also delivers a clean improvement on both GenEval and DPG-Bench.

ModelGenEval ↑DPG-Bench ↑
Baseline (Scale-RAE)42.674.3
Ours (Scale-RAE + DSR)46.675.4

Qualitative samples (text-to-image)

Same prompt, same seed grid, baseline (left) vs. DSR (right). DSR's improvements tend to show up as cleaner composition, more consistent object identity, and fewer texture artifacts.

"a photo of a cow"
Baseline
baseline samples for: a photo of a cow
DSR (ours)
DSR samples for: a photo of a cow
"a photo of a purple potted plant"
Baseline
baseline samples for: a photo of a purple potted plant
DSR (ours)
DSR samples for: a photo of a purple potted plant
"a photo of an elephant below a surfboard"
Baseline
baseline samples for: a photo of an elephant below a surfboard
DSR (ours)
DSR samples for: a photo of an elephant below a surfboard
"a photo of a white toilet and a red apple"
Baseline
baseline samples for: a photo of a white toilet and a red apple
DSR (ours)
DSR samples for: a photo of a white toilet and a red apple

BibTeX

@article{wu2026dsr, title = {Taming Outlier Tokens in Diffusion Transformers}, author = {Wu, Xiaoyu and Wang, Yifei and Fu, Tsu-Jui and Chen, Liang-Chieh and Gan, Zhe and Wei, Chen}, year = {2026}, eprint = {2605.05206}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2605.05206}, }