Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding

1Cornell University
ICML 2026

Block Diffusion:

Fixed block partitioning
KV-cache updates only after each block

Set Diffusion (Ours):

Flexible position-biased decoding
KV-cache updates every step
more likely to decode less likely to decode

Core Idea

Diffusion language models can generate many tokens in parallel, but standard masked diffusion is usually fixed-length and cannot reuse a KV cache. Block diffusion restores variable-length, cacheable generation by decoding contiguous blocks left-to-right. Set diffusion keeps the autoregressive factorization, but replaces fixed blocks with flexible-position, flexible-length token sets.

Why Blocks Are Not Enough

Block diffusion models a sequence as a left-to-right sequence of contiguous blocks. It is already a useful bridge between autoregressive and diffusion language models: tokens inside a block can be denoised in parallel, and completed blocks can be cached.

The limitation is the fixed block structure. A model cannot naturally insert tokens at arbitrary positions, choose differently sized groups of tokens, or update the cache until the whole current block has been decoded. That restricts infilling, arbitrary-position editing, and parallel sampling.

Set Diffusion

Set diffusion models a sequence by marginalizing over generation orders of position sets. A position set \( \sigma_n \) is a nonempty subset of token positions; the sets are disjoint and together cover the whole sequence. At step \( n \), an order policy selects the next positions, and a diffusion model predicts their token values conditioned on earlier revealed sets:

\[ p_\theta(x) = \sum_\sigma \prod_{n=1}^{N} \pi(\sigma_n \mid x_{<\sigma_n}) p_\theta(x_{\sigma_n} \mid x_{<\sigma_n}). \]

This factorization contains several familiar cases. Single left-to-right tokens recover autoregression. Uniform random positions recover order-agnostic masked diffusion. Fixed contiguous left-to-right sets recover block diffusion. The useful middle ground is everything in between.

Flexible positions

Sets can contain arbitrary positions, enabling insertion and infilling rather than only next-block decoding.

Flexible sizes

Larger sets expose more parallelism; smaller sets preserve a stronger autoregressive bias.

Cacheable steps

After a set is accepted, its keys and values can be committed before the next inference step.

Interpolating Token Orderings

Instead of parameterizing the ordering distribution \( \pi(\sigma) \) directly, set diffusion samples reveal times. Let \( \tau \in [0,1] \) denote normalized ordering time. Each position \( \ell \) has a monotone schedule \( \alpha^\ell_\tau \), where \( \alpha^\ell_\tau = \Pr(R_\ell \le \tau) \) is the probability that position \( \ell \) has been revealed by time \( \tau \). Sorting the sampled reveal times \( R_\ell \) produces a generation order; ties form token sets.

Position-offset schedules shift each position's active generation interval along the sequence. The interval width \( w \) controls the ordering bias: as \( w \) approaches \( 1/L \), the order concentrates on left-to-right generation; as \( w \) grows, more positions are eligible together, increasing parallelism and moving toward order-agnostic diffusion.

Reveal-time CDFs interpolating from autoregression through Sliding-Window SetDLM to MDLM

The decoding width controls how strongly the order favors left-to-right generation.

Sliding-Window SetDLM

SW-SetDLM is a practical instantiation of set diffusion. It combines a position-offset ordering distribution, a factorized token-set likelihood, and a set-causal transformer. At inference time, a sliding output window selects positions whose active intervals contain the current ordering time; those candidates are denoised in parallel, accepted tokens are committed, and their keys and values are appended to the cache before the next step.

For low-variance, token-efficient training, SW-SetDLM specializes to singleton token sets. It samples a full ordering, then computes all \( L \) conditional token likelihoods in one causal forward pass,

\[ -\log p_\theta(x) \le -\mathbb{E}_{\sigma \sim \pi} \sum_{n=1}^{L} \log p_\theta(x_{\sigma_n} \mid x_{<\sigma_n}). \]

This teaches the model the position-biased orders used at inference, while keeping the training objective close to a standard causal language-modeling pass. At inference, the same model can decode flexible-position sets instead of being locked to fixed block boundaries.

Results

On GSM8K, SW-SetDLM improves the diffusion Pareto frontier relative to block diffusion, with higher accuracy and higher throughput at comparable prediction budgets.

GSM8K speed-accuracy tradeoff showing SW-SetDLM above block diffusion
Model PPL (↓) 0-shot pass@1 (↑) Tput (↑)
AR Transformer 1.25 75.74 67.16
MDLM ≤ 2.10 6.37 ≥ 24.48
BD3LM, S = 4 ≤ 1.41 63.53 ≥ 55.39
SW-SetDLM, S ≤ 8 ≤ 1.42 66.41 ≥ 60.42

Infilling

Infilling directly tests flexible-position decoding. On 1,871 five-sentence ROCStories examples, the task masks one or three middle sentences and fills the gap from the remaining context. Against BD3LM, SW-SetDLM raises one-sentence ROUGE-L from 8.6 to 10.9 while increasing throughput from ≥ 105.8 to ≥ 132.0 tokens/sec; for three-sentence infilling, ROUGE-L rises from 11.1 to 13.2 with throughput ≥ 123.1 versus ≥ 114.2. MDLM reaches higher ROUGE, but is substantially slower because it lacks KV caching and recomputes the full 1024-token context at every denoising step.

Model Size Tokens Infill 1/5 R-1 / 2 / L (↑) 1/5 Tput (↑) Infill 3/5 R-1 / 2 / L (↑) 3/5 Tput (↑)
Autoregression
GPT2-S† 127M n/a 9.5 / 0.4 / 8.7 - 13.5 / 0.6 / 10.2 -
AR Transformer 130M 157B 8.2 / 0.5 / 7.6 159.7 ± 1.8 18.3 / 1.3 / 13.1 158.4 ± 0.8
Diffusion
SEDD-S† 170M 210B 11.6 / 0.8 / 10.7 - 16.2 / 1.3 / 12.2 -
MDLM 130M 157B 14.5 / 1.6 / 13.3 ≥ 71.9 ± 2.9 22.2 / 2.3 / 15.2 ≥ 72.6 ± 2.7
DiffuGPT-S† 127M n/a 14.0 / 1.5 / 13.0 - 16.4 / 2.0 / 14.2 -
ASSD† 110M 45B 13.1 / 1.1 / 12.0 - 18.0 / 1.4 / 13.2 -
AR+Diffusion Hybrid
BD3LM, S = 16 110M 157B 9.2 / 0.6 / 8.6 ≥ 105.8 ± 4.5 15.8 / 0.8 / 11.1 ≥ 114.2 ± 1.3
SW-SetDLM, S ≤ 32 110M 157B 11.6 / 1.0 / 10.9 ≥ 132.0 ± 1.8 18.1 / 1.3 / 13.2 ≥ 123.1 ± 2.6

ROCStories infilling results. "Infill k/5" fills k middle sentences given the rest; diffusion throughputs are lower bounds because sampling uses the maximum number of denoising steps.

Infilling

Flexible-position sets improve ROCStories infilling over block diffusion while decoding faster.

Summarization

On CNN/DailyMail, SW-SetDLM keeps competitive ROUGE and decodes up to 10% faster than block diffusion.

Unconditional Generation

On unconditional generation, SW-SetDLM improves MAUVE over block diffusion while decoding faster.

BibTeX

@inproceedings{arriola2026setdiffusion,
  title = {Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding},
  author = {Arriola, Marianne and Kuleshov, Volodymyr},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year = {2026}
}