Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences.
In the language modeling task, we have a sequence of \( L \) tokens \( \mathbf{x} = (\mathbf{x}^1, \dots, \mathbf{x}^L ) \) drawn from the data distribution \( q(\mathbf{x}) \). We aim to fit a model \( p_\theta(\mathbf{x}) \) of \( q \).
Autoregressive models define a factorized distribution of the form:\[ \log p_\theta(\mathbf{x}) = \sum_{\ell=1}^L \log p_\theta(\mathbf{x}^\ell \mid \mathbf{x}^{\lt \ell}) \]
However, the sequential dependencies between tokens require that AR sampling is restricted to \( L \) sampling steps, which may be slow for long sequences.
Diffusion models overcome this limitation by modeling tokens independently, admitting parallel generation. Diffusion models instead fit a model to undo a forward corruption process \( q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \text{Cat}(\mathbf{x_t} ; Q_t \mathbf{x}_{t-1} ) \) using transition matrices \( Q_t \). D3PM (Austin et. al) defines this as
\[ p_\theta(\mathbf{x}_s | \mathbf{x}_t) = \prod_{\ell=1}^L p_\theta (\mathbf{x}_s^\ell | \mathbf{x}_t) = \sum_{\mathbf{x}} \left[q(\mathbf{x}_s^{\ell} | \mathbf{x}_t^\ell, \mathbf{x}^\ell) p_\theta(\mathbf{x}^{\ell} | \mathbf{x}_t) \right] \]
where the denoising base model \( p_\theta(\mathbf{x}^\ell | \mathbf{x}_t) \) predicts clean token \( \mathbf{x}^\ell \) given noised tokens \( \mathbf{x}_t \). However, the diffusion objective minimizes a bound on the likelihood. As a result, diffusion models lag in terms of likelihood and sample quality. Furthermore, diffusion models are restricted to generate fixed length sequences.
We combine modeling paradigms to enjoy better likelihoods & flexible-length generation from autoregressive models, as well as fast & parallel generation from diffusion models.
We propose a modeling framework that autoregressively models blocks of tokens and performs diffusion within each block. Our likelihood factorizes over \( B \) blocks of length \( L' \) as
\[ \log p_\theta (\mathbf{x}) = \sum_{b=1}^B \log p_\theta (\mathbf{x}^b | \mathbf{x}^{\lt b}) \]
Each \( p_\theta (\mathbf{x}^b | \mathbf{x}^{\lt b}) \) is modeled using discrete diffusion ELBO over a block of \( L' \) tokens. We obtain a principled learning objective \( \mathcal{L}_\text{BD}(\mathbf{x}, \theta) \) by optimizing the following likelihood bound:
\[ \log p_\theta(\mathbf{x}) \geq \mathcal{L}_\text{BD}(\mathbf{x}, \theta) := \sum_{b=1}^{B} \mathcal{L}_{\text{diffusion}}(\mathbf{x}^b, \mathbf{x}^{\lt b}, \theta), \]
We model the per-block likelihood under a simple discrete diffusion parameterization (Sahoo et. al, Shi et. al, Ou et. al). Our final objective becomes a sum of weighted cross-entropy terms:\[ \mathcal{L}_\text{BD}(\mathbf{x}, \theta) := - \sum_{b=1}^{B} \mathbb{E}_{t \sim [0, 1]} \mathbb{E}_{q} \frac{1}{t} \log p_\theta(\mathbf{x}^b | \mathbf{x}_{t}^b, \mathbf{x}^{\lt b}) \]
Naively, we would compute the logits by applying \( \mathbf{x}_\theta^b( \mathbf{x}_t^b, \mathbf{K}^{1:b\text{-}1}, \mathbf{V}^{1:b\text{-}1}) \) in a loop \( B\) times. Instead, we only require two forward passes. The first pass precomputes keys and values \( \mathbf{K}^{1:B}, \mathbf{V}^{1:B} \) for the full sequence \( \mathbf{x}\). In the second forward pass, we compute denoised predictions for all blocks simulatenously using \( \mathbf{x}_\theta^b( \mathbf{x}_t^b, \mathbf{K}^{1:b\text{-}1}, \mathbf{V}^{1:b\text{-}1}) \).
To sample from BD3-LMs, we generate one block at a time, conditioned on previously sampled blocks. After generating a block, we cache its keys and values, similar to AR. We may use any diffusion sampling procedure \( \text{SAMPLE} ( \mathbf{x}_\theta^b, \mathbf{K}^{1:b\text{-}1}, \mathbf{V}^{1:b\text{-}1}) \) to sample from the conditional distribution \( p_\theta (\mathbf{x}_s^b | \mathbf{x}_t^b, \mathbf{x}^{ < b}) \) over \( T\) sampling steps per block.
Our block diffusion parameterization is equivalent in expectation to the autoregressive NLL in the limiting case where \( L'=1 \). Surprisingly, we find a two point perplexity gap between our block diffusion model for \( L'=1 \) and AR when training both models on the LM1B dataset. We identify high training variance of the diffusion objective as responsible for the perplexity gap.
Intuitively, if the sampled masking rate \( t \sim \mathcal{U}[0, 1] \) is too low, reconstructing \( \mathbf{x} \) is easy, which does not provide a useful learning signal. If we mask everything, the optimal reconstruction are the marginals of each token in the data distribution, which is easy to learn, and again not useful.
We seek to find noise schedules that minimize training variance caused by the diffusion objective and further reduce the perplexity gap.To avoid masking rates that cause high-variance training, we train BD3-LMs under "clipped" masking rates \( t \sim \mathcal{U}[\beta, \omega] \) for \( 0 \leq \beta, \omega \leq 1 \). By reducing the training variance, we improve likelihoods when we evaluate under uniformly sampled mask rates.
As the optimal mask rates may differ depending on the block size \(L'\), we adaptively learn \( \beta, \omega \) during training. In practice, we do so using a grid search during every validation step, after 5K gradient updates, to optimize \(\min_{\beta, \omega} \text{Var}_{\mathbf{X}, t} \left[ \mathcal{L}_{\text{BD}}(\theta, \beta, \omega; \mathbf{X}) \right] \).
Below, we show that optimizing the noise schedule per block size reduces the variance of the loss estimator and achieves the best perplexities compared to alternative schedules.
BD3-LMs | Noise schedule | PPL | Var. ELBO |
---|---|---|---|
L' = 4 | Linear t ~ U[0, 1] | 30.18 | 23.45 |
Clipped t ~ U[0.45, 0.95] | 29.21 | 6.24 | |
Clipped t ~ U[0.3, 0.8] | 29.38 | 10.33 | |
Logarithmic | 30.36 | 23.53 | |
L' = 16 | Linear t ~ U[0, 1] | 31.72 | 7.62 |
Clipped t ~ U[0.45, 0.95] | 31.42 | 3.60 | |
Clipped linear t ~ U[0.3, 0.8] | 31.12 | 3.58 | |
Cosine | 31.41 | 13.00 |
BD3-LMs achieve state-of-the-art likelihoods among diffusion models. As shown below, BD3-LMs interpolate between diffusion and autoregressive likelihoods by tuning the block length \( L' \).
Model | PPL (↓) |
---|---|
AR | 17.54 |
SEDD | ≤ 24.10 |
MDLM | ≤ 22.98 |
BD3-LMs L' = 16 | ≤ 22.27 |
BD3-LMs L' = 8 | ≤ 21.68 |
BD3-LMs L' = 4 | ≤ 20.73 |
One key drawback of many existing diffusion language models is that they cannot generate full-length documents that are longer than the length of the output context chosen at training time. For example, OpenWebText contains documents up to 131K tokens, whereas discrete diffusion model SEDD (Lou et. al) is restricted to generate 1024 tokens. Below, we show that BD3-LMs can generate variable-length documents by decoding an arbitrary number of blocks.
Median # tokens | Max # tokens | |
---|---|---|
OWT train set | 717 | 131K |
AR | 4008 | 131K |
SEDD | 1021 | 1024 |
BD3-LM L'=16 | 798 | 9982 |
We assess the generation quality of BD3-LMs on variable-length sequences, comparing all methods using the same number of generation steps (NFEs). Below, we measure the generative perplexity of sampled sequences under GPT2-Large. BD3-LMs achieve the best generative perplexities compared to all previous diffusion methods.
Category | Model | ||||
---|---|---|---|---|---|
Gen. PPL (↓) | NFEs | Gen. PPL (↓) | NFEs | ||
Autoregressive | AR | ||||
Diffusion | SEDD | ||||
MDLM | |||||
Block diffusion | SSD-LM L' = 25 | ||||
BD3-LMs L' = 16 | |||||
L' = 8 | |||||
L' = 4 |
For MDLM, we use their block-wise decoding technique (which does not feature block diffusion training as in BD3-LMs) for L=2048. We also compare to SSD-LM (Han et. al) an alternative block-autoregressive method (also known as semi-autoregression) that performs Gaussian diffusion over word embeddings but cannot perform likelihood estimation. Our discrete approach yields samples with improved generative perplexity using an order of magnitude fewer generation steps.
We presented Block Discrete Diffusion Language models, a new model class that combines strength of both autoregressive and diffusion approaches while overcoming their limitations. Block diffusion overcomes key drawbacks of existing discrete diffusion models: the quality gap to AR model and their inability to generate arbitrary-length sequences or support KV caching. By doing so, BD3-LMs set a new state-of-the-art among discrete diffusion models. Our work presents a promising step forward in building powerful diffusion language models that are competitive with standard LLMs, while offering parallel token generation and improved controllability of samples.
@inproceedings{
arriola2025interpolating,
title={Interpolating Autoregressive and Discrete Denoising Diffusion Language Models},
author={Marianne Arriola and Aaron Gokaslan and Justin T Chiu and Jiaqi Han and Zhihan Yang and Zhixuan Qi and Subham Sekhar Sahoo and Volodymyr Kuleshov},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=tyEyYT267x}
}