DPRM: A Plug-in Token-Ordering Module for Diffusion Language Models
1 Token ordering is treated as a first-class control variable in diffusion language models (DLMs): DPRM is a plug-in module that changes only the ordering policy, while keeping the host architecture, denoising objective, supervision, and data pipeline unchanged.
2 Core idea: start from confidence-driven progressive ordering (efficient and aligned with inference), then gradually shift toward reward-aware ordering using online estimates of process reward; the goal is to reduce the myopia of confidence-only ordering (quality–exploration dilemma).
3 Formal target policy: DPRM derives an exact controller via a Doob h-transform of the host Markov chain, yielding a reward-tilted Gibbs “reveal law” over candidate token updates: base proposal q0(a|s) is reweighted by exp(β R*t(a;s)), where R*t is a log-moment of future terminal reward given taking action a.
4 Practical controller: exact Doob guidance is expensive, so DPRM uses (i) shortlist sampling from the base proposal, (ii) Soft-BoN reweighting within the shortlist, and (iii) an online bucketized estimator of process reward indexed by coarse phase (early/mid/late) and confidence bins; a gate interpolates between pure confidence and reward-tilted scores based on warmup time and bucket “readiness” counts.
5 Theory highlights: (a) stagewise Soft-BoN approximation converges to the exact DPRM target at O(1/N) in terminal KL (N = shortlist size); (b) online bucketized estimates track the exact DPRM score at empirical-Bernstein-type rates, up to bias from coarsening/warmup/nonstationarity; (c) under tractable stagewise optimization assumptions, DPRM has a sample-complexity advantage over both random ordering and confidence-only progressive ordering.
6 Natural-language results as matched interventions across the pipeline: DPRM-PUMA improves GSM8K validation mean from 29.34 to 34.27 in diffusion pretraining; DPRM-DMPO improves harder reasoning subsets (e.g., MATH Hard 44.3→47.9, Countdown Hard 29.6→33.4) in post-training; DPRM-Prism improves GSM8K test-time scaling voted accuracy 82.41→83.85, with higher inference cost (more NFE) but unchanged verification-call count.
7 Scientific-domain transfer: ordering remains high-leverage but becomes multi-objective. In protein inverse folding (DPLM-2 Bit), ordering-aware variants improve forward-folding RMSD (35.47→29.43) and TM-score (0.3071→0.3321), while other co-generation metrics show trade-offs. This supports “ordering as a knob” rather than a universally dominating heuristic.
8 Single-cell gene-expression diffusion (DCM) shows a clean gain: token recovery improves 63.97%→75.92% and zero-expression accuracy 78.39%→99.90% under DPRM variants, indicating that reward-aware ordering can strongly affect reconstruction behavior in sparse, high-dimensional masked diffusion.
9 Molecular and DNA diffusion: in GenMol V2, DPRM variants improve fragment-constrained metrics like linker validity (0.142→0.429) and scaffold quality (0.429→0.712), while de novo diversity/uniqueness can shift; in SDPO for DNA regulatory design, DPRM raises HepG2-related reward while preserving more ATAC and k-mer quality than confidence-only progressive ordering, illustrating controllable reward–quality trade-offs via ordering.
💻Code:
github.com/DakeBU/DPRM-DLLM
📜Paper:
arxiv.org/abs/2604.24357
#DiffusionLM #DiscreteDiffusion #TokenOrdering #DoobTransform #Reasoning #ProteinDesign #SingleCell #MoleculeGeneration #DNADesign #MLTheory