Core Idea
Diffusion language models can generate many tokens in parallel, but standard masked diffusion is usually fixed-length and cannot reuse a KV cache. Block diffusion restores variable-length, cacheable generation by decoding contiguous blocks left-to-right. Set diffusion keeps the autoregressive factorization, but replaces fixed blocks with flexible-position, flexible-length token sets.
Why Blocks Are Not Enough
Block diffusion models a sequence as a left-to-right sequence of contiguous blocks. It is already a useful bridge between autoregressive and diffusion language models: tokens inside a block can be denoised in parallel, and completed blocks can be cached.
The limitation is the fixed block structure. A model cannot naturally insert tokens at arbitrary positions, choose differently sized groups of tokens, or update the cache until the whole current block has been decoded. That restricts infilling, arbitrary-position editing, and parallel sampling.
Set Diffusion
Set diffusion models a sequence by marginalizing over generation orders of position sets. A position set \( \sigma_n \) is a nonempty subset of token positions; the sets are disjoint and together cover the whole sequence. At step \( n \), an order policy selects the next positions, and a diffusion model predicts their token values conditioned on earlier revealed sets:
This factorization contains several familiar cases. Single left-to-right tokens recover autoregression. Uniform random positions recover order-agnostic masked diffusion. Fixed contiguous left-to-right sets recover block diffusion. The useful middle ground is everything in between.
Sets can contain arbitrary positions, enabling insertion and infilling rather than only next-block decoding.
Larger sets expose more parallelism; smaller sets preserve a stronger autoregressive bias.
After a set is accepted, its keys and values can be committed before the next inference step.
Interpolating Token Orderings
Instead of parameterizing the ordering distribution \( \pi(\sigma) \) directly, set diffusion samples reveal times. Let \( \tau \in [0,1] \) denote normalized ordering time. Each position \( \ell \) has a monotone schedule \( \alpha^\ell_\tau \), where \( \alpha^\ell_\tau = \Pr(R_\ell \le \tau) \) is the probability that position \( \ell \) has been revealed by time \( \tau \). Sorting the sampled reveal times \( R_\ell \) produces a generation order; ties form token sets.
Position-offset schedules shift each position's active generation interval along the sequence. The interval width \( w \) controls the ordering bias: as \( w \) approaches \( 1/L \), the order concentrates on left-to-right generation; as \( w \) grows, more positions are eligible together, increasing parallelism and moving toward order-agnostic diffusion.
The decoding width controls how strongly the order favors left-to-right generation.
Sliding-Window SetDLM
SW-SetDLM is a practical instantiation of set diffusion. It combines a position-offset ordering distribution, a factorized token-set likelihood, and a set-causal transformer. At inference time, a sliding output window selects positions whose active intervals contain the current ordering time; those candidates are denoised in parallel, accepted tokens are committed, and their keys and values are appended to the cache before the next step.
For low-variance, token-efficient training, SW-SetDLM specializes to singleton token sets. It samples a full ordering, then computes all \( L \) conditional token likelihoods in one causal forward pass,
This teaches the model the position-biased orders used at inference, while keeping the training objective close to a standard causal language-modeling pass. At inference, the same model can decode flexible-position sets instead of being locked to fixed block boundaries.
Results
On GSM8K, SW-SetDLM improves the diffusion Pareto frontier relative to block diffusion, with higher accuracy and higher throughput at comparable prediction budgets.
| Model | PPL (↓) | 0-shot pass@1 (↑) | Tput (↑) |
|---|---|---|---|
| AR Transformer | 1.25 | 75.74 | 67.16 |
| MDLM | ≤ 2.10 | 6.37 | ≥ 24.48 |
| BD3LM, S = 4 | ≤ 1.41 | 63.53 | ≥ 55.39 |
| SW-SetDLM, S ≤ 8 | ≤ 1.42 | 66.41 | ≥ 60.42 |
Infilling
Infilling directly tests flexible-position decoding. On 1,871 five-sentence ROCStories examples, the task masks one or three middle sentences and fills the gap from the remaining context. Against BD3LM, SW-SetDLM raises one-sentence ROUGE-L from 8.6 to 10.9 while increasing throughput from ≥ 105.8 to ≥ 132.0 tokens/sec; for three-sentence infilling, ROUGE-L rises from 11.1 to 13.2 with throughput ≥ 123.1 versus ≥ 114.2. MDLM reaches higher ROUGE, but is substantially slower because it lacks KV caching and recomputes the full 1024-token context at every denoising step.
| Model | Size | Tokens | Infill 1/5 R-1 / 2 / L (↑) | 1/5 Tput (↑) | Infill 3/5 R-1 / 2 / L (↑) | 3/5 Tput (↑) |
|---|---|---|---|---|---|---|
| Autoregression | ||||||
| GPT2-S† | 127M | n/a | 9.5 / 0.4 / 8.7 | - | 13.5 / 0.6 / 10.2 | - |
| AR Transformer | 130M | 157B | 8.2 / 0.5 / 7.6 | 159.7 ± 1.8 | 18.3 / 1.3 / 13.1 | 158.4 ± 0.8 |
| Diffusion | ||||||
| SEDD-S† | 170M | 210B | 11.6 / 0.8 / 10.7 | - | 16.2 / 1.3 / 12.2 | - |
| MDLM | 130M | 157B | 14.5 / 1.6 / 13.3 | ≥ 71.9 ± 2.9 | 22.2 / 2.3 / 15.2 | ≥ 72.6 ± 2.7 |
| DiffuGPT-S† | 127M | n/a | 14.0 / 1.5 / 13.0 | - | 16.4 / 2.0 / 14.2 | - |
| ASSD† | 110M | 45B | 13.1 / 1.1 / 12.0 | - | 18.0 / 1.4 / 13.2 | - |
| AR+Diffusion Hybrid | ||||||
| BD3LM, S = 16 | 110M | 157B | 9.2 / 0.6 / 8.6 | ≥ 105.8 ± 4.5 | 15.8 / 0.8 / 11.1 | ≥ 114.2 ± 1.3 |
| SW-SetDLM, S ≤ 32 | 110M | 157B | 11.6 / 1.0 / 10.9 | ≥ 132.0 ± 1.8 | 18.1 / 1.3 / 13.2 | ≥ 123.1 ± 2.6 |
ROCStories infilling results. "Infill k/5" fills k middle sentences given the rest; diffusion throughputs are lower bounds because sampling uses the maximum number of denoising steps.
Flexible-position sets improve ROCStories infilling over block diffusion while decoding faster.
On CNN/DailyMail, SW-SetDLM keeps competitive ROUGE and decodes up to 10% faster than block diffusion.
On unconditional generation, SW-SetDLM improves MAUVE over block diffusion while decoding faster.
BibTeX
@inproceedings{arriola2026setdiffusion,
title = {Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding},
author = {Arriola, Marianne and Kuleshov, Volodymyr},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
year = {2026}
}