Rethinking Representation Complexity in Drug–Target Prediction via Supervised Vector Quantization
1.Most drug–target interaction (DTI) models use dense, high-dimensional embeddings from pretrained models, but this study shows that many of those features are redundant or irrelevant. It proposes a new plug-and-play supervised vector quantization (SVQ) module that filters out noise while enhancing performance and interpretability.
2.The SVQ framework uses a vector quantization layer to compress and discretize continuous drug/protein embeddings. These quantized features outperform existing models like ConPLex and MolTrans on multiple benchmarks, including BIOSNAP and BindingDB.
3.Interestingly, the best results were achieved by reducing features: over 70% of the original pretrained features were found to be uninformative. Feature selection with Random Forest Boruta kept only a fraction (e.g., 4.3% in DAVIS), yet performance increased, highlighting the danger of "more is better" in deep learning representations.
4.The SVQ module replaces manual feature selection with an end-to-end learnable codebook, making it compatible with modern deep learning pipelines. It discretizes the embedding space, reducing redundancy while preserving discriminative patterns via learnable codewords.
5.Compared to recent models, SVQ achieved top AUPR scores on BIOSNAP (0.928) and BindingDB (0.668), and remained competitive on DAVIS. Even in zero-shot tests with unseen drugs/targets, SVQ maintained robust generalization.
6.Beyond accuracy, SVQ enhances interpretability. Codeword usage patterns reveal domain-specific interactions. Drugs targeting similar protein domains (e.g., kinases, ion channels, immunoglobulins) share similar codeword activation patterns, suggesting the model learns biologically meaningful structure.
7.A simple bag-of-words (BoW) representation based on codeword frequency—without using any semantic embeddings—still preserved the structural clustering of drugs, as confirmed by t-SNE and Mantel tests. This suggests that co-occurrence patterns alone carry significant predictive signal.
8.Even with randomly initialized and frozen codebooks, the model was trainable and lost only minor performance (<5% on BIOSNAP), confirming that precise embedding semantics are not essential—it's the codeword usage patterns that matter.
9.This reframes representation learning in DTI: co-occurrence and structure of quantized features may matter more than high-dimensional continuous embeddings. The SVQ model represents a shift toward compact, interpretable, and efficient models in drug discovery.
💻Code:
github.com/jdcc2098/SVQDTI
📜Paper:
biorxiv.org/content/10.1101/…
#DTI #DeepLearning #DrugDiscovery #ProteinLanguageModel #VectorQuantization #Bioinformatics #InterpretableAI