← Yifei Wang NeurIPS · 2025

Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction

Yifei Wang1,5 Weimin Bai1,2,3 Colin Zhang4 Debing Zhang4 Weijian Luo4† He Sun1,2,3
1College of Future Technology, Peking University   2National Biomedical Imaging Center, PKU   3Academy for Advanced Interdisciplinary Studies, PKU
4hi-lab, Xiaohongshu Inc   5Yuanpei College, PKU
Corresponding author

TL;DR

One-step diffusion distillation has split into two camps: KL-based methods (Diff-Instruct, DMD, f-distill) and score-based methods (SiD, SIM, SiDA). Uni-Instruct shows these are all special cases of a single object — the diffusion expansion of the f-divergence family — and turns it into a tractable training loss.

New SoTA one-step FID: 1.46 on CIFAR10 unconditional, 1.38 on CIFAR10 conditional, and 1.02 on ImageNet 64×64 — beating the 79-step EDM teacher (2.35) by a wide margin.
Uni-Instruct conceptual overview FID comparison on ImageNet 64x64

Left: Uni-Instruct unifies >10 distillation methods under one f-divergence framework. Right: One-step FID on ImageNet 64×64 — Uni-Instruct (FKL, long training) reaches 1.02 NFE=1.

Two camps, one framework

Most one-step diffusion distillation methods can be put in one of two boxes:

The natural question: can we unify them so we get the best of both? Uni-Instruct answers yes, by proving a diffusion expansion of the f-divergence family. The expansion gives a continuous family of losses whose gradients decompose into a weighted combination of Grad(DI) (the KL branch) and Grad(SIM) (the score branch). Existing methods sit at specific corners of this family.

The full picture: every row below is a prior method, and the third column shows the divergence inside Uni-Instruct that recovers it.

Method Loss Div. in UI Task Gradient
Diff-Instruct (DI)IKLχ²One-step diffusionGrad(DI)
DI++IKL + rewardχ²Human-aligned diffusionGrad(DI) + ∇θreward
DI*KL + rewardRKLHuman-aligned diffusionGrad(SIM) + ∇θreward
SDSIKLχ²Text-to-3DGrad(DI)
DDSIKLχ²Image editingGrad(DI)
VSDIKLχ²Text-to-3DGrad(DI)
DMDIKL + regressionχ²One-step diffusionGrad(DI) + ∇θMSE
RedDiffIKL + data fidelityχ²Inverse problemsGrad(DI) + ∇θMSE
DMD2IKL + GANχ²One-step diffusionGrad(DI) + ∇θadv
SwiftBrushIKLχ²One-step diffusionGrad(DI)
SIMgeneral KLRKLOne-step diffusionGrad(SIM)
SiDKLRKLOne-step diffusionGrad(SIM)
SiDAKL + GANRKLOne-step diffusionGrad(SIM) + ∇θadv
SiD-LSGKLRKLOne-step diffusionGrad(SIM)
f-distillI-f + GANχ²One-step diffusionλf·Grad(DI) + ∇θadv
Uni-Instruct (ours)f-div. + GANallall of the aboveλfDI·Grad(DI) + λfSIM(x)·Grad(SIM) + ∇θadv
Reproduced from paper Table 1. "Div. in UI" = the f-divergence inside Uni-Instruct that recovers each prior method.

What's actually new

1. Diffusion expansion of f-divergences

We start from the standard f-divergence Df(q‖p) and consider the diffusion expansion that matches not just the data distribution but the entire forward-noise marginals (qt, pθ,t) for t ∈ [0, T]. This expansion is what gives one-step distillation its supervisory signal — but the resulting objective is intractable for general f.

2. Tractable equivalent loss via gradient-equivalence theorems

We prove that the gradient of the expanded f-divergence is gradient-equivalent to a closed-form expression involving the teacher score and the student's implicit score. Concretely, for any valid f, the gradient decomposes into two terms that we already know how to compute (Grad(DI) and Grad(SIM)), with f-specific weighting functions λf. This unblocks training for the whole f-divergence family — including Forward-KL, Reverse-KL, Jensen-KL (JKL), χ², JS — without needing density ratios.

3. Flexible recipes: RKL → FKL warm-up

The framework makes it trivial to swap or combine divergences. We find a particularly clean two-stage recipe: train with RKL until convergence, then continue with FKL. RKL gives strong mode-seeking convergence; FKL follows up with mode-covering to fix tail behavior. This is what produces our best ImageNet 64×64 number.

Why this matters: unifying prior work across four applications

The same Uni-Instruct framework recovers existing methods from four different application areas just by picking a divergence and adding (or omitting) an auxiliary loss. Below is a guided tour of how each prior method drops out as a special case — adapted from Appendix D of the paper.

1

One-step diffusion distillation

Picking χ² kills the SIM weighting and leaves pure Grad(DI). Picking RKL does the opposite. Two camps fall out cleanly:

χ² branch · Grad(DI)
Diff-Instruct DMD DMD2 SwiftBrush f-distill
RKL branch · Grad(SIM)
SiD SIM SiDA SiD-LSG
2

Text-to-3D (NeRF distillation)

DreamFusion's SDS and ProlificDreamer's VSD both minimise integral KL between a NeRF-rendered distribution and a frozen text-to-image diffusion. With the right weighting W(t), that's exactly the χ² (Grad(DI)) instance of Uni-Instruct.

∫ w(t) DKL(pθ,t(x|c,y) ‖ qt(x|yc)) dt  ⇒  Uni-Instruct (χ²)

Swapping in FKL (instead of the implicit χ²) gives the 3D-vase results below.

3

Solving inverse problems

RedDiff approximates p(x|y) by a variational q(x), expands the KL along the diffusion trajectory, and ends up with a score-matching loss:

∫ w(t) DKL(pθ,t(xt|y) ‖ qt) dt = ∫ ½ g²(t) W(t) 𝔼 ‖spθ,t − sqt‖² dt

So RedDiff = Uni-Instruct (RKL) + a tractable data-fidelity term.

4

Human-preference alignment

RLHF for diffusion adds a KL regulariser to a reference distribution plus a reward. Two recent instantiations sit at different f's:

  • DI++ (integral KL)  ⇒  χ² Uni-Instruct + reward
  • DI* (score-based KL)  ⇒  RKL Uni-Instruct + reward

Same template: pick f, optionally add reward / GAN / regression.

Results

CIFAR-10 (unconditional)

FamilyModelNFEFID ↓
TeacherVP-EDM351.97
ConsistencysCT / ECT / iCT22.06 / 2.11 / 2.46
One-stepDiff-Instruct14.53
One-stepDMD13.77
One-stepSiD11.92
One-stepSiDA11.52
One-stepSiD²A11.50
One-stepUni-Instruct (RKL, F.S.)11.52
One-stepUni-Instruct (FKL, F.S.)11.52
One-stepUni-Instruct (FKL, L.T.)11.48
One-stepUni-Instruct (JKL, F.S.)11.46
F.S. = from scratch · L.T. = resumed and longer training.

CIFAR-10 (class-conditional)

FamilyModelNFEFID ↓
TeacherVP-EDM351.79
One-stepDiff-Instruct14.19
One-stepSIM11.96
One-stepSiD11.71
One-stepSiDA / SiD²A11.44 / 1.40
One-stepf-distill11.92
One-stepUni-Instruct (RKL, F.S.)11.44
One-stepUni-Instruct (JKL, F.S.)11.42
One-stepUni-Instruct (FKL, F.S.)11.43
One-stepUni-Instruct (FKL, L.T.)11.38

ImageNet 64×64 (class-conditional)

FamilyModelNFEFID ↓
TeacherVP-EDM5111.36
DiffusionADM2502.07
DiffusionDiT-L/22502.91
ConsistencyECT12.49
Few-stepDMD2 (longer)11.28
Few-stepSiD11.71
Few-stepSiDA / SiD²A11.35 / 1.10
Few-stepf-distill11.16
Few-stepUni-Instruct (RKL, F.S.)11.35
Few-stepUni-Instruct (JKL, F.S.)11.28
Few-stepUni-Instruct (FKL, F.S.)11.34
Few-stepUni-Instruct (FKL, L.T.)11.02

Ablation: divergence × GAN × init

Switching divergence inside the same framework changes FID by >3×. JKL is the best stand-alone choice; GAN loss helps every divergence; warm-starting from a converged RKL run further improves all variants.

DivergenceSiD init.GANFID ↓
none (GAN only)8.21
χ²4.37
JS5.23
JKL1.46
RKL1.92
FKL1.88
RKL1.52
FKL1.52
RKL1.50
FKL1.48
JKL1.50

Bonus: text-to-3D

Because Uni-Instruct subsumes SDS/VSD, it plugs straight into a ProlificDreamer-style text-to-3D pipeline. Adding a discriminator head on the U-Net encoder and distilling for ~400 epochs with FKL gives 3D vases with visibly more diversity and crisper surface detail than ProlificDreamer.

"A refined vase with artistic patterns."
ProlificDreamer (baseline)
Uni-Instruct (FKL)

BibTeX

@inproceedings{wang2025uniinstruct, title = {Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction}, author = {Wang, Yifei and Bai, Weimin and Zhang, Colin and Zhang, Debing and Luo, Weijian and Sun, He}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2025}, eprint = {2505.20755}, archivePrefix = {arXiv}, primaryClass = {cs.LG}, url = {https://arxiv.org/abs/2505.20755}, }