A Benchmark for Quantum Chemistry Relaxations via Machine Learning Interatomic Potentials
1.PubChemQCR is the largest publicly available dataset of DFT-based molecular relaxation trajectories, with 3.5 million molecules and over 300 million conformations, including 105 million computed with DFT. Each conformation includes total energy and atomic force labels.
2.The dataset captures full geometry optimization trajectories, not just final structures—addressing a key gap in previous datasets. This enables machine learning interatomic potentials (MLIPs) to learn from both stable and non-equilibrium geometries.
3.PubChemQCR offers broad chemical diversity, spanning 25 elements and a wide range of molecular sizes and conformational complexities. It was built from PubChemQC’s raw optimization outputs, spanning PM3, Hartree–Fock, and DFT stages.
4.Compared to existing datasets like QM9, GEOM, or ANI-1x, PubChemQCR provides significantly more conformational data, better element coverage, and crucial force labels at high-accuracy DFT level—making it uniquely suited for training MLIPs.
5.A curated subset, PubChemQCR-S, contains \~41K DFT relaxation trajectories for efficient model benchmarking. This subset supports rapid prototyping, ablation studies, and hyperparameter tuning.
6.The authors benchmarked 9 MLIP models (SchNet, PaiNN, NequIP, FAENet, Equiformer, etc.) on energy and force prediction tasks using PubChemQCR-S. Equiformer achieved the best overall performance on both energy and force metrics.
7.In geometry optimization tasks, Equiformer outperformed all other models, achieving 70.15% average energy minimization, 23.81% chemical accuracy success rate, and a 19.85% force convergence rate. Most other models struggled, especially with force convergence.
8.The dataset supports supervised pretraining of 3D molecular models with physically grounded energy and force labels—potentially benefiting downstream property prediction tasks in drug discovery and materials science.
9.It also enables training of generative models for 3D molecular structures. These models can learn to generate low-energy conformations directly from the data, bypassing costly DFT optimization.
10.Limitations include the dataset's near-equilibrium bias (due to DFT relaxation) and inconsistent label quality across optimization stages. Also, chemical element coverage is capped at 25 due to DFT method constraints.
11.Despite limitations, PubChemQCR is a foundational resource for building accurate, transferable, and data-efficient MLIPs. It can accelerate atomistic simulations, geometry optimization, and generative modeling in quantum chemistry.
💻Code:
huggingface.co/divelab
📜Paper:
arxiv.org/abs/2506.23008v1
#QuantumChemistry #ML4Science #DFT #GraphNeuralNetworks #MolecularSimulation #MachineLearning #OpenScience #MolecularModeling