Sequence-based prediction of drug–target binding using machine learning, deep learning and ensemble models without 3D structural information
1. The study presents a fully sequence-based drug–target interaction (DTI) pipeline designed to work when reliable 3D protein structures are missing, aiming for docking-comparable discrimination while keeping the feature space interpretable.
2. Core idea: a unified, hand-engineered representation that concatenates (i) protein physicochemical descriptors (e.g., hydrophobicity/charge/composition), (ii) BLOSUM62-derived evolutionary information, (iii) protein 3-gram motif frequencies (local sequence context), and (iv) sequence-like drug encodings derived from canonical SMILES motif frequencies.
3. The framework is intentionally model-agnostic: it evaluates classical ML (Logistic Regression, SVM, Random Forest), deep learning (MLP, CNN), and multiple ensemble approaches (Extra Trees, Gradient Boosting, Histogram-based Gradient Boosting), plus a stacking ensemble.
4. The stacking classifier combines Random Forest SVM Logistic Regression via a meta-learner, leveraging complementary decision boundaries; reported mean ROC-AUC exceeds 0.90, with a maximum AUC of 0.914 under the paper’s protocol.
5. Evaluation emphasizes methodological hygiene: stratified 5-fold cross-validation, with fold-wise preprocessing to reduce leakage risk; scaling and SMOTE are applied only within training folds (validation/test folds remain untouched).
6. SMOTE’s impact is analyzed explicitly and shown to be model-dependent: it often improves minority-class sensitivity/recall, while effects on Accuracy/F1/ROC-AUC can vary across architectures—highlighting why imbalance handling must be reported alongside metrics.
7. Beyond scalar metrics, the paper inspects learning dynamics for deep models (CNN convergence behavior), confusion matrices for representative models, and ROC curves to characterize threshold-independent discrimination across folds.
8. Interpretability is treated as a first-class goal: permutation importance and SHAP analyses identify influential features, with protein-derived features (physicochemical properties and specific 3-gram motifs) frequently dominating—supporting biologically grounded explanations rather than opaque latent embeddings.
9. For orthogonal validation, the authors perform molecular docking (AutoDock Vina) on selected predicted pairs, using PDB structures and/or AlphaFold2 models filtered by confidence (pLDDT/PAE). A showcased case (EGFR) aligns high predicted binding probability with favorable docking scores (e.g., around −6.4/−6.2 kcal/mol), used as qualitative support.
10. Limitations are acknowledged: no independent external test set; stratified CV may still be optimistic if similar proteins/ligands appear across folds; docking validation is illustrative rather than a dataset-wide quantitative correlation—future work proposed includes stricter splits (cold-start/sequence-identity/scaffold splits) and broader docking benchmarks.
📜Paper:
doi.org/10.1038/s41598-026-5…
#DrugDiscovery #ComputationalBiology #Bioinformatics #MachineLearning #DeepLearning #EnsembleLearning #DTI #Cheminformatics #ExplainableAI #VirtualScreening