Learning Structure, Energy, and Dynamics: A Survey of Artificial Intelligence for Protein Dynamics
1. The survey maps the protein-dynamics AI landscape into three complementary training signals: learning from structural ensembles/trajectories, learning from physical energy signals (Boltzmann learning), and learning to accelerate or replace components of molecular dynamics (ML force fields, coarse graining, and collective variables).
2. A key theme is the shift from inference-time “make AlphaFold diverse” heuristics to explicit generative modeling of p(x|sequence) for equilibrium ensembles, using diffusion, flow matching, and latent language-modeling approaches that can be trained end-to-end on ensemble data and sampled efficiently.
3. For equilibrium ensemble generation, it highlights how modern methods incorporate stronger priors and broader conditioning: MSA-free sequence-conditioned generators (e.g., latent diffusion with protein language models), temperature/thermodynamic conditioning, and energy-conditioned sampling to target distinct conformational states rather than a single dominant fold.
4. It emphasizes practical failure modes in purely data-driven ensemble learning—limited physical realism, scarcity of diverse dynamic data, and PDB conformational bias—and surveys mitigation strategies such as force/energy guidance, physics-informed objectives (e.g., Fokker–Planck supervision), experimental stability signals, and dataset reweighting via structural clustering.
5. The review extends “dynamics generation” beyond i.i.d. conformers to explicit trajectory models p(X|x1): (i) learned long-timestep transition kernels (MCMC-style), (ii) autoregressive frame prediction with improved temporal scalability (including state-space-model adaptations), and (iii) one-shot “trajectory-as-video” generation that outputs full time-ordered coordinate sequences in a single pass.
6. A central innovation thread is energy-driven learning: Boltzmann generators and related samplers that use energies/forces to learn proposals for the Boltzmann distribution, then correct residual bias with self-normalized importance sampling or annealing/SMC-style procedures, with effective sample size (ESS) used as a key reliability diagnostic.
7. The survey contrasts exact-likelihood normalizing-flow Boltzmann generators (enabling principled reweighting) with likelihood-free diffusion/flow approaches trained from energetic supervision, and discusses the core tradeoff: thermodynamic guarantees vs scalability, symmetry handling, and computational cost of likelihood/energy evaluations.
8. It also covers “physics-aware adaptation” of pretrained protein generators: post-training alignment and inference-time steering that tilt a base generator toward lower-free-energy or constraint-satisfying samples using energies, forces, projections, or CV/observable-based biases—aiming to retrofit thermodynamic meaning without retraining from scratch.
9. On the simulation side, it reviews how ML accelerates or upgrades MD via: (i) machine-learning potentials approaching QM fidelity while scaling to large biomolecular systems (often via fragmentation, Δ-learning, and explicit long-range terms), (ii) learned coarse-grained models that approximate potentials of mean force for longer timescales, and (iii) ML-derived collective variables for enhanced sampling using dimensionality reduction, kinetic learning, RL/adaptive sampling, and differentiable generative constraints.
10. The survey closes by curating key datasets (static PDB and distilled AFDB/AFESM; MD trajectory corpora like ATLAS, DynamicPDB, mdCATH, MISATO, DD-13M; and IDP/experimental resources like BMRB and SASBDB) and framing open challenges: scalability, thermodynamic consistency, kinetic fidelity, dataset/force-field bias, and principled integration with experimental constraints.
📜Paper:
arxiv.org/abs/2604.25244
#ProteinDynamics #ComputationalBiology #MolecularDynamics #GenerativeAI #DiffusionModels #FlowMatching #BoltzmannGenerators #MachineLearningPotentials #CoarseGraining #EnhancedSampling