Set Diffusion

Core Idea

Diffusion language models can generate many tokens in parallel, but standard masked diffusion is usually fixed-length and cannot reuse a KV cache. Block diffusion restores variable-length, cacheable generation by decoding contiguous blocks left-to-right. Set diffusion keeps the autoregressive factorization, but replaces fixed blocks with flexible-position, flexible-length token sets.

Why Blocks Are Not Enough

Block diffusion models a sequence as a left-to-right sequence of contiguous blocks. It is already a useful bridge between autoregressive and diffusion language models: tokens inside a block can be denoised in parallel, and completed blocks can be cached.

The limitation is the fixed block structure. A model cannot naturally insert tokens at arbitrary positions, choose differently sized groups of tokens, or update the cache until the whole current block has been decoded. That restricts infilling, arbitrary-position editing, and parallel sampling.

Set Diffusion

Set diffusion models a sequence by marginalizing over generation orders of position sets. A position set \( \sigma_n \) is a nonempty subset of token positions; the sets are disjoint and together cover the whole sequence. At step \( n \), an order policy selects the next positions, and a diffusion model predicts their token values conditioned on earlier revealed sets:

\[ p_\theta(x) = \sum_\sigma \prod_{n=1}^{N} \pi(\sigma_n \mid x_{<\sigma_n}) p_\theta(x_{\sigma_n} \mid x_{<\sigma_n}). \]

This factorization contains several familiar cases. Single left-to-right tokens recover autoregression. Uniform random positions recover order-agnostic masked diffusion. Fixed contiguous left-to-right sets recover block diffusion. The useful middle ground is everything in between.

Flexible positions

Sets can contain arbitrary positions, enabling insertion and infilling rather than only next-block decoding.

Flexible sizes

Larger sets expose more parallelism; smaller sets preserve a stronger autoregressive bias.

Cacheable steps

After a set is accepted, its keys and values can be committed before the next inference step.

Interpolating Token Orderings

Instead of parameterizing the ordering distribution \( \pi(\sigma) \) directly, set diffusion samples reveal times. Let \( \tau \in [0,1] \) denote normalized ordering time. Each position \( \ell \) has a monotone schedule \( \alpha^\ell_\tau \), where \( \alpha^\ell_\tau = \Pr(R_\ell \le \tau) \) is the probability that position \( \ell \) has been revealed by time \( \tau \). Sorting the sampled reveal times \( R_\ell \) produces a generation order; ties form token sets.

Position-offset schedules shift each position's active generation interval along the sequence. The interval width \( w \) controls the ordering bias: as \( w \) approaches \( 1/L \), the order concentrates on left-to-right generation; as \( w \) grows, more positions are eligible together, increasing parallelism and moving toward order-agnostic diffusion.

The decoding width controls how strongly the order favors left-to-right generation.

Sliding-Window SetDLM

SW-SetDLM is a practical instantiation of set diffusion. It combines a position-offset ordering distribution, a factorized token-set likelihood, and a set-causal transformer. At inference time, a sliding output window selects positions whose active intervals contain the current ordering time; those candidates are denoised in parallel, accepted tokens are committed, and their keys and values are appended to the cache before the next step.

For low-variance, token-efficient training, SW-SetDLM specializes to singleton token sets. It samples a full ordering, then computes all \( L \) conditional token likelihoods in one causal forward pass,

\[ -\log p_\theta(x) \le -\mathbb{E}_{\sigma \sim \pi} \sum_{n=1}^{L} \log p_\theta(x_{\sigma_n} \mid x_{<\sigma_n}). \]

This teaches the model the position-biased orders used at inference, while keeping the training objective close to a standard causal language-modeling pass. At inference, the same model can decode flexible-position sets instead of being locked to fixed block boundaries.

Results

On GSM8K, SW-SetDLM improves the diffusion Pareto frontier relative to block diffusion, with higher accuracy and higher throughput at comparable prediction budgets.

GSM8K speed-accuracy tradeoff showing SW-SetDLM above block diffusion

Model	PPL (↓)	0-shot pass@1 (↑)	Tput (↑)
AR Transformer	1.25	75.74	67.16
MDLM	≤ 2.10	6.37	≥ 24.48
BD3LM, S = 4	≤ 1.41	63.53	≥ 55.39
SW-SetDLM, S ≤ 8	≤ 1.42	66.41	≥ 60.42

Infilling

Infilling directly tests flexible-position decoding. On 1,871 five-sentence ROCStories examples, the task masks one or three middle sentences and fills the gap from the remaining context. Against BD3LM, SW-SetDLM raises one-sentence ROUGE-L from 8.6 to 10.9 while increasing throughput from ≥ 105.8 to ≥ 132.0 tokens/sec; for three-sentence infilling, ROUGE-L rises from 11.1 to 13.2 with throughput ≥ 123.1 versus ≥ 114.2. MDLM reaches higher ROUGE, but is substantially slower because it lacks KV caching and recomputes the full 1024-token context at every denoising step.

Model	Size	Tokens	Infill 1/5 R-1 / 2 / L (↑)	1/5 Tput (↑)	Infill 3/5 R-1 / 2 / L (↑)	3/5 Tput (↑)
Autoregression
GPT2-S†	127M	n/a	9.5 / 0.4 / 8.7	-	13.5 / 0.6 / 10.2	-
AR Transformer	130M	157B	8.2 / 0.5 / 7.6	159.7 ± 1.8	18.3 / 1.3 / 13.1	158.4 ± 0.8
Diffusion
SEDD-S†	170M	210B	11.6 / 0.8 / 10.7	-	16.2 / 1.3 / 12.2	-
MDLM	130M	157B	14.5 / 1.6 / 13.3	≥ 71.9 ± 2.9	22.2 / 2.3 / 15.2	≥ 72.6 ± 2.7
DiffuGPT-S†	127M	n/a	14.0 / 1.5 / 13.0	-	16.4 / 2.0 / 14.2	-
ASSD†	110M	45B	13.1 / 1.1 / 12.0	-	18.0 / 1.4 / 13.2	-
AR+Diffusion Hybrid
BD3LM, S = 16	110M	157B	9.2 / 0.6 / 8.6	≥ 105.8 ± 4.5	15.8 / 0.8 / 11.1	≥ 114.2 ± 1.3
SW-SetDLM, S ≤ 32	110M	157B	11.6 / 1.0 / 10.9	≥ 132.0 ± 1.8	18.1 / 1.3 / 13.2	≥ 123.1 ± 2.6

ROCStories infilling results. "Infill k/5" fills k middle sentences given the rest; diffusion throughputs are lower bounds because sampling uses the maximum number of denoising steps.

Infilling

Flexible-position sets improve ROCStories infilling over block diffusion while decoding faster.

Summarization

On CNN/DailyMail, SW-SetDLM keeps competitive ROUGE and decodes up to 10% faster than block diffusion.

Unconditional Generation

On unconditional generation, SW-SetDLM improves MAUVE over block diffusion while decoding faster.

BibTeX

@inproceedings{arriola2026setdiffusion,
  title = {Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding},
  author = {Arriola, Marianne and Kuleshov, Volodymyr},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year = {2026}
}

Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding

Block Diffusion:

Set Diffusion (Ours):

Core Idea

Why Blocks Are Not Enough

Set Diffusion

Interpolating Token Orderings

Sliding-Window SetDLM

Results

Infilling

BibTeX