BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models
1. BOOM is the first systematic benchmark focused on evaluating the out-of-distribution (OOD) generalization performance of molecular property prediction models, a critical challenge for enabling ML-guided discovery of novel molecules.
2. The authors assess 12 ML models across 10 molecular properties using over 140 model-task combinations, revealing that even top models exhibit up to 3x higher error on OOD data compared to in-distribution (ID) test sets.
3. No current model shows robust OOD performance across all tasks. MACE leads in OOD performance on 5 of 10 tasks, while ET dominates ID tasks, indicating that strong in-distribution accuracy does not guarantee generalization.
4. Pretraining strategies like masked language modeling (MLM) significantly improve ID performance but fail to enhance, and sometimes degrade, OOD performance—highlighting a key limitation in current chemical foundation models.
5. The benchmark defines OOD splits based on the tails of molecular property distributions, aligning with real-world discovery goals where desirable molecules often lie outside known distributions.
6. 3D-aware models, especially those with E(3)-equivariance like EGNN and MACE, outperform SMILES-based transformers in OOD settings. Representation choice is thus more critical than scale for extrapolation.
7. Hyperparameter tuning targeting OOD performance offers some benefit, particularly for simple properties like density or heat of formation, but is not sufficient to close the generalization gap.
8. Data augmentation by including a small number of OOD molecules in training substantially improves generalization for 7 of 8 tasks tested, suggesting that even modest exposure to rare examples helps overcome distributional shifts.
9. ModernBERT, though a transformer model, incorporates architecture changes that improve OOD performance in tasks like HoF and Cv, narrowing the gap with graph-based models and showing promise for LLM-style scalability.
10. The study identifies specific property types (e.g., dipole moment, HOMO, LUMO) as persistent weak points for OOD prediction, likely due to the absence of explicit electronic structure features in most models.
11. BOOM provides an open-source benchmark, dataset, and codebase to standardize OOD evaluation and accelerate the development of chemically generalizable machine learning models.
12. This work positions OOD generalization—not just ID accuracy—as a new frontier for chemical ML, essential for reliable property extrapolation and robust molecular discovery.
📜Paper:
arxiv.org/abs/2505.01912
#Chemoinformatics #OutOfDistribution #MachineLearning #MolecularDesign #GraphNeuralNetworks #MolecularProperty #Benchmarking #ML4Science #GNN #SMILES #Pretraining #DataAugmentation #BOOM