Unified Genomic and Chemical Representations Enable Bidirectional Biosynthetic Gene Cluster and Natural Product Retrieval
1. Liu, Li, Ong et al. present BCCoE, a multimodal retrieval framework that puts biosynthetic gene clusters (BGCs) and natural products into a shared embedding space, enabling both directions of search: BGC→compound and compound→BGC.
2. The key idea is to reuse strong pretrained “foundation” embeddings from each modality, then learn only a lightweight alignment: BiGCARP embeddings for BGC Pfam-domain sequences (256D) MoLFormer embeddings for compound SMILES (768D), projected into a 64D co-embedding space for cosine-similarity nearest-neighbor retrieval.
3. Architecture: two modality-specific encoders (same structure, separate weights) that apply (i) linear projection, (ii) a 2-layer transformer encoder, (iii) pooling concatenation with the mean of the original embedding sequence, then (iv) batch norm a 2-layer MLP to output the final co-embedding vectors.
4. Training is metric learning with N-pair loss over batches of paired (BGC, compound) examples from MIBiG; foundation-model embeddings are frozen to reduce overfitting and to preserve general representations. Negatives are implicitly taken from other pairs within the same batch (efficient “in-batch” negatives).
5. Why alignment matters: baselines that do retrieval without cross-modal alignment (KNN and a two-hop KNN-2hop that chains BGC-similarity and compound-similarity) cannot consistently capture genotype–chemotype links, especially when candidate pools include novel items not seen during training.
6. Main quantitative results on MIBiG 4.0 (10-fold CV): for BGC→compound retrieval at top-10, Recall improves from 12.9% (KNN) and 21.9% (KNN-2hop) to 32.9% (BCCoE); for compound→BGC at top-10, BCCoE reaches 65.3% Recall (vs 60.6% KNN-2hop), with very large lift over random guessing at low K.
7. Generalization to unseen product classes (hold out one entire BGC product class during training): performance drops for all methods, but BCCoE remains substantially better, achieving Lift@10 of 17.0 (BGC→compound) and 20.2 (compound→BGC), outperforming KNN-2hop by ~75–89% in lift at top-10.
8. Temporal generalization (train on MIBiG 3.1, evaluate on new links added in MIBiG 4.0): BCCoE improves identification of newly added BGC–compound pairs, e.g., when retrieving compounds from the full MIBiG 4.0 candidate set, top-10 hits rise from 126 (KNN-2hop) to 180 (BCCoE) among 473 new pairs.
9. Robustness across alternative foundation models: swapping in ESM-C for BGCs or Uni-Mol2 for compounds shows BCCoE remains relatively stable, while KNN-2hop can degrade sharply due to “similarity saturation” (cosine similarities clustered near 1 in the initial embedding spaces), which breaks two-hop score ranking; BCCoE’s aligned space yields a more well-behaved similarity distribution.
10. Practical validation beyond MIBiG: on three experimentally validated external BGC–compound pairs previously used in BGC-MAP, BCCoE ranks the true matches much higher in both directions (BGC→compound and compound→BGC), supporting its use for prioritizing candidates in real discovery workflows.
💻Code:
zenodo.org/records/18849052
📜Paper:
doi.org/10.1038/s41598-026-4…
#Bioinformatics #ComputationalBiology #NaturalProducts #GenomeMining #BiosyntheticGeneClusters #MultimodalAI #MetricLearning #RepresentationLearning #Cheminformatics