UBio-MolFM: A Universal Molecular Foundation Model for Bio-Systems
1. UBio-MolFM targets the “scale–accuracy gap” in biomolecular simulation: it aims to deliver ab initio-like DFT fidelity while scaling to solvated, heterogeneous bio-systems that are too large for standard QM and too complex for fixed-form classical force fields.
2. The framework is built on three coupled pillars: (i) UBio-Mol26, a bio-specific multi-fidelity dataset up to ~1,200 atoms; (ii) E2Former-V2, a linear-scaling SO(3)-equivariant transformer optimized for large systems; (iii) a three-stage curriculum that enforces energy–force consistency and handles heterogeneous reference levels.
3. Data innovation: UBio-Mol26 contains ~17M configurations and is designed explicitly for biology (proteins, nucleic acids, lipids, drug-like molecules, and cross-modal interactions) in explicit solvent, with element coverage including biologically important ions/metals. It complements small-molecule-centric datasets by emphasizing macromolecular chemical environments (e.g., enriched amide and methylene motifs).
4. A “Two-Pronged Strategy” is used to build UBio-Mol26: bottom-up enumeration of biochemical building blocks (e.g., exhaustive tripeptides) plus top-down sampling of native protein environments by extracting residue-centered solvated clusters from AlphaFold DB structures, with chemical capping to preserve realism.
5. Multi-fidelity QM labeling is treated as a first-class design choice to make large-system DFT feasible: ωB97M-D3 is used (cost-reduced vs VV10 variants), a mixed basis strategy improves SCF convergence, and a large def2-SVP subset enables ~10× more data at a small fraction of compute, while retaining a higher-fidelity def2-TZVPD subset.
6. Architecture innovation: E2Former-V2 combines (a) node-centric factorization to reduce edge materialization, (b) Long–Short Range (LSR) modeling to capture non-local physics without fully connected atomic graphs, and (c) Equivariant Axis-Aligned Sparsification (EAAS) that reduces dense SO(3) tensor products into sparse operations via axis-aligned frames while preserving exact equivariance.
7. Systems-level innovation: a fused “on-the-fly” equivariant attention kernel (Triton) computes attention with online softmax and streaming reductions, avoiding storing per-edge attention tensors. This is positioned as a practical route to improved memory locality and throughput on large atom counts.
8. Training innovation: a Three-Stage Curriculum Learning protocol: Stage 1 initializes on OMol25 with separate energy/force heads (fast, avoids autograd forces); Stage 2 enforces conservative forces via F = −∇E; Stage 3 mixes OMol25 with UBio-Mol26 using dual heads (SVP vs high-fidelity), dataset balancing (8:1:1), filtering for compatibility, and force-only supervision for subsets with systematic energy offsets.
9. Large-system OOD generalization is explicitly tested beyond the training size cap: test systems are ~1,300–1,500 atoms (proteins/DNA/RNA optimization and solvated-protein MD clusters), with DFT references computed using GPU4PySCF. UBio-MolFM Stage 3 substantially improves protein and RNA force/relative-energy accuracy versus general-purpose baselines (MACE-OMol, UMA-S-1p1), while highlighting a remaining weakness: DNA temporal energy stability (∆E) can regress, motivating targeted DNA data expansion.
10. Downstream MD fidelity is evaluated on macroscopic/structural observables: liquid water O–O RDF matches experimental structure; 0.15 M NaCl shows realistic ion hydration peaks and coordination numbers; Cyclosporine A maintains solvent-dependent open (water) vs closed (vacuum) conformations via H-bond competition; RNA (1L2X) Mg2 binding geometry is reproduced with more realistic Mg–O and Mg–O–P distributions than Amber99 OL3 and the tested ML baseline.
11. Efficiency results (single H100, conservative-force setting): UBio-MolFM (S3, 24M) reports markedly higher throughput on large systems (e.g., 1K atoms: 61 steps/s vs UMA-S 16; 10K: 6.1 vs 1.6; 50K: 0.72 vs 0.20), while noting memory limits when long-range interactions are enabled at extreme sizes (100K atoms OOM for UBio-MolFM in the reported setting).
12. Release plan and resources: authors describe an open-science release including pretrained weights, an inference engine, and a representative dataset subset. A public protein-focused subset (UBio-Protein26 5M) is provided for benchmarking, alongside code and model checkpoints intended to lower barriers for QM-accurate biomolecular simulation workflows.
💻Code:
github.com/IQuestLab/UBio-Mo…
📜Paper:
arxiv.org/abs/2602.17709
#ComputationalBiology #MolecularDynamics #MachineLearning #EquivariantNetworks #ForceFields #QuantumChemistry #FoundationModels #ProteinSimulations #RNADynamics #ScientificMachineLearning