Embedded Morgan Fingerprints for more efficient molecular property predictions with machine learning
1.This work introduces Embedded Morgan Fingerprints (eMFP), a compact representation of traditional Morgan Fingerprints (MFP) designed to improve both speed and accuracy in molecular property prediction using machine learning.
2.eMFP reduces the dimensionality of high-bit MFP vectors by applying an embedding technique inspired by one-hot encoding compression. This helps mitigate overfitting and accelerates training without compromising molecular structural information.
3.Compared to MFP, eMFP achieves faster training and better prediction accuracy across various models including Random Forest, MLP, K-Neighbors Regressor, Gradient Boosted Trees, and Deep Neural Networks.
4.Evaluated across three datasets—RedDB, NFA, and QM9—the eMFP outperformed standard MFP in regression tasks predicting HOMO-LUMO gaps, especially for large datasets like QM9.
5.The optimal compression factors for eMFP were q = 16 and q = 32 for small/medium datasets, and q = 16 and q = 64 for large datasets, striking a balance between compactness and model performance.
6.eMFP retains the essential features of MFP, as confirmed through Principal Component Analysis and KDE of predictions. The structural integrity of the encoded data is preserved even at high compression.
7.Regression models trained with eMFP achieved higher R² values and narrower residual distributions (FWHM), indicating improved generalization and more consistent prediction quality.
8.Training time was significantly reduced using eMFP—up to several orders of magnitude faster than MFP—making it a practical choice under computational constraints and in large-scale modeling tasks.
9.eMFP enabled more extensive hyperparameter optimization within fixed time limits, leading to better model tuning. In contrast, MFP often failed to complete full optimization runs on large datasets like QM9.
10.This work highlights the potential of embedding techniques not just for cheminformatics, but also as a general strategy for compressing high-dimensional categorical data in ML workflows.
11.Overall, eMFP offers a more efficient, scalable, and often superior alternative to MFP, especially valuable for tasks requiring large datasets or fast model iteration.
💻Code:
github.com/MMLabCodes/eMFP
📜Paper:
doi.org/10.26434/chemrxiv-20…
#MachineLearning #Cheminformatics #MolecularML #Fingerprinting #DimensionalityReduction #DeepLearning #QM9 #OpenSource #GraphRepresentation #ComputationalChemistry