A Triple-Modal Contrastive Learning Framework with Sequence, Graph, and 3D Features for Drug–Target Interaction Prediction
1 TriMod-DTI is presented as a triple-modal DTI framework that jointly models 1D sequences, 2D graphs, and 3D structures for both drugs and proteins, aiming to learn universal yet complementary representations rather than relying on single- or bi-modal fusion.
2 The paper motivates tri-modality with an embedding cosine-similarity analysis (on GPCR) showing low similarity across 1D/2D/3D embeddings (mostly within [-0.25, 0.25]), suggesting strong complementarity and that explicit cross-modal modeling is needed.
3 For drugs, TriMod-DTI encodes: (i) SMILES sequences segmented by FCS and processed by a Transformer; (ii) 2D molecular graphs from RDKit with atom features (75-d) encoded by a GCN; (iii) 3D molecular graphs built from SDF coordinates with edges by distance (<4.5 Å) encoded by a GVP-GNN to integrate scalar/vector geometric features.
4 For proteins, TriMod-DTI encodes: (i) amino-acid sequences (FCS Transformer); (ii) binding-site pocket graphs extracted from OmegaFold-predicted structures, pocket detection via prior method, then TAGCN attention pooling to get pocket-aware embeddings; (iii) a residue-level 3D structural graph (Cα nodes; edges via 8 Å neighbor search) encoded with a GCN.
5 A core methodological piece is triple-modal cross-modal contrastive learning (inspired by CLIP-style alignment): embeddings of the same entity (drug or protein) across modalities form positive pairs (1D–2D, 2D–3D, 1D–3D), while other entities in-batch form negatives, aligning modalities to reduce distribution mismatch before fusion.
6 After alignment, the model concatenates all six embeddings (d1⊕d2⊕d3⊕t1⊕t2⊕t3) and uses an MLP classifier for binary DTI prediction; the total objective combines cross-entropy with separate contrastive losses for drugs and targets weighted by hyperparameters.
7 On three benchmarks (Human, GPCR, DrugBank), TriMod-DTI reports consistent improvements in AUC and often AUPR/Precision versus baselines spanning sequence-only and sequence graph methods; notably on GPCR it improves AUPR and Precision over a strong multi-attention baseline, while on DrugBank it yields best AUC/Precision but lower AUPR, attributed to class imbalance.
8 Ablations indicate the full tri-modal contrastive objective matters: removing contrastive learning or any cross-modal component degrades performance; the full contrastive setup is reported to improve over a non-contrastive variant (e.g., 1.1% AUC and 2.0% AUPR in their summary).
9 Modality-only analysis suggests sequence contributes most, graph next, and 3D alone is weakest in their setup; the authors argue 3D still adds complementary local spatial context when combined, and note a limitation that their 3D drug encoder may omit key chemical attributes (e.g., atom types/charges), leaving room for improved geometric/chemical featurization.
10 A case study ranks candidate targets for Verapamil and reports literature support for 5 of the top-10 predictions; docking for a top-ranked predicted target (Glucose-6-phosphate isomerase 2) suggests plausible hydrogen-bonding interactions in the pocket, illustrating potential utility for hypothesis generation.
💻Code:
github.com/klez1/TriMod-DTI
📜Paper:
arxiv.org/abs/2605.29926
#DrugDiscovery #DTI #MachineLearning #DeepLearning #MultimodalLearning #ContrastiveLearning #GeometricDeepLearning #Cheminformatics #Bioinformatics #ComputationalBiology