Symmetric Self-play Online Preference Optimization for Protein Inverse Folding
1. The paper argues that multi-objective inverse folding shouldn’t be forced into a single scalar reward: structural objectives (e.g., self-consistency vs prediction confidence) are only partially aligned, so a single-policy optimizer tends to follow a dominant direction and miss alternative high-quality designs.
2. It introduces SSP (Symmetric Self-play Preference Optimization), an online preference-optimization framework that keeps two separate policies: one optimized for structural self-consistency (Rsc) and the other for predicted confidence (Rpred), plus an EMA-updated reference model for stabilization.
3. Key mechanism: both policies (and the reference) sample sequences for the same backbone, merge them into a shared candidate pool, refold with ESMFold, filter low-confidence samples, then build preference pairs for each objective from the same pool—creating implicit cross-policy competition and comparison without collapsing objectives.
4. Rewards are explicitly decoupled: Rsc combines scTM and RMSD-based terms after aligning predicted structures to the target backbone; Rpred averages pLDDT and pTM from structure prediction. Training uses DPO-style preference loss plus an SFT term on preferred samples.
5. To prevent policy collapse and encourage complementary exploration, SSP adds (i) a Jensen–Shannon divergence regularizer between the two policies and (ii) an entropy bonus, aiming to cover different regions of the Pareto frontier rather than converging to the same solution.
6. SSP is shown to be architecture-agnostic: implemented on ProteinMPNN (full fine-tuning), ESM-IF1 and ESM3 (LoRA fine-tuning). After training, it produces a single deployable model via merging: task-vector merging for full-parameter models, and weighted LoRA adapter merging for parameter-efficient setups.
7. On native-backbone benchmarks (CATH4.2/4.3), SSP variants consistently improve structure prediction confidence (pTM/pLDDT), self-consistency (scTM, RMSDs), and related metrics over base models and several RL/DPO baselines, indicating that the dual-policy setup improves design self-consistency beyond standard single-policy preference optimization.
8. Generalization is tested on CAMEO43 (targets with max TM-score < 0.5 vs training set). SSP improves pTM/pLDDT and self-consistency vs baselines, supporting the claim that decoupled objectives help in harder, lower-similarity regimes rather than only on in-distribution backbones.
9. Transfer to de novo binder backbones is emphasized as a “real-world” proxy: on BoltzGen-419 (DNA/RNA/peptide binders) and PXDesign-PPI226 (protein binders with specified hotspots), the merged SSP-ESM3 model leads across pTM/scTM and interface confidence (ipTM), and achieves strong design success rates—suggesting robustness across binder types and backbone generators.
10. The paper provides interpretability evidence that the two objectives truly induce different optimization directions: white-box analysis of ESM3 LoRA updates shows low subspace overlap (SVD-based) and near-orthogonal update directions (cosine similarity near 0) across many layers, with only mild alignment in deeper layers—supporting the “partially aligned objectives” hypothesis.
đź’»Code:
github.com/wwzll123/SSP
📜Paper:
biorxiv.org/content/10.64898…
#ProteinDesign #InverseFolding #ReinforcementLearning #PreferenceOptimization #MultiObjectiveOptimization #ComputationalBiology #ESM3 #ProteinMPNN #LoRA #SelfPlay