Encoder-Decoder Diffusion Language Models for Efficient Training and Inference

Cornell Tech. *Equal contribution; corresponding authors
NeurIPS 2025

Slow, decoder-only diffusion.
Large decoder denoises tokens & caches clean context


Fast, encoder-decoder diffusion.
Small decoder denoises tokens Large encoder caches clean context


To generate samples, diffusion models iteratively refine a sequence consisting of both clean and corrupted tokens. Our key insight is that this process consists of: 1) producing useful representations of clean tokens and 2) denoising corrupted tokens.

However, prior diffusion language models jointly perform both tasks within the same decoder-only architecture. Thus, these models must expensively invoke the full network at every denoising step.

We propose an encoder-decoder transformer architecture to separate the computation for these tasks. We use an encoder to represent clean tokens and a lightweight decoder to iteratively refine a noised sequence conditioned on the encoder’s representation. This enables faster inference, as we call the lightweight decoder multiple times to iteratively denoise tokens and invoke the encoder only periodically to update its representations.

Our Efficient Encoder-Decoder Diffusion (E2D2) consists of an encoder-decoder transformer architecture complemented with efficient training and sampling algorithms that enable both faster inference and KV caching support.

mask
Efficient Encoder-Decoder Diffusion (E2D2) enables faster generation than decoder-only architectures.

Our encoder-decoder architecture enables faster training of block diffusion models, which partition sequences into blocks to improve generation quality and support KV caching. Block diffusion is widely used for generating with large diffusion language models, even those trained with standard full-sequence diffusion (e.g., LLaDA, Seed Diffusion, MMaDA).

However, decoder-only block diffusion incurs higher training costs, with forward passes that are 2× more expensive than standard diffusion, as both the full clean and noised sequences must be processed in every transformer layer.

Encoder-decoder block diffusion uses the encoder to process the clean sequence and the decoder to process the noised sequence, halving training costs compared to a decoder-only model of equal size. During inference, the decoder generates each block of tokens, then the encoder caches their KVs.

Results

We focus on applying the encoder-decoder architecture to parameterizing block diffusion as it attains superior language modeling performance compared to full-sequence masked diffusion, it enables exact KV caching for faster inference, and even recent diffusio LLMs which are trained with the full-sequence masked diffusion parameterization apply on block autoregressive decoding at inference.

By varying the depth of E2D2’s decoder and that of decoder-only block diffusion (BD3LM), we examine the trade-off between performance and throughput. We fine-tune models for mathematical reasoning on the GSM8K dataset and compute 0-shot pass@1 accuracy and decoding throughput. E2D2 extends the Pareto frontier of quality and speed as shown below.

mask
Mapping the Pareto Frontier: Larger models increase accuracy on GSM8K at the cost of slower decoding.

For machine translation, E2D2 is able to match or outperform our diffusion baselines while achieving higher throughput. Compared to MDLM, which does not support exact KV caching, E2D2 offers better downstream task performance with ∼3× faster inference. As shown below, E2D2 achieves higher throughput and task performance relative to the 16-layer BD3LM. While a small 12-layer BD3LM approaches the throughput of E2D2, its BLEU score worsens further.

Model N Tput (↑) BLEU (↑)
AR3277.6 ± 0.425.2
MDLM3260.4 ± 0.818.4
BD3LM12129.6 ± 0.723.3
BD3LM16102.4 ± 0.524.0
E2D2 (Ours)28/4162.0 ± 1.424.8
WMT (de-en) test BLEU score. Best values for our trained models are bolded.

As above, E2D2 shows improved downstream performance on mathematical reasoning and decoding throughput compared to diffusion baselines.

Evaluation on GSM8K test set. Best diffusion value is bolded.
Model N PPL (↓) 0-shot pass@1 (↑) Tput (↑)
AR281.4966.694.1 ± 0.5
MDLM28≤ 2.3014.031.9 ± 3.0
BD3LM21≤ 1.8733.286.6 ± 0.5
E2D2 (Ours)28/14≤1.8047.9102.8 ± 0.6

BibTeX


@inproceedings{
  arriola2025e2d2,
  title={Encoder-Decoder Diffusion Language Models for Efficient Training and Inference},
  author={Marianne Arriola and Yair Schiff and Hao Phung and Aaron Gokaslan and Volodymyr Kuleshov},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={https://arxiv.org/abs/2510.22852}
}