Geometry-enhanced Protein Language Modeling Enables Discovery of Novel Antibiotic Resistance Genes
1. The study introduces GeoARG, a geometry-enhanced antibiotic resistance gene (ARG) predictor that distills 3D structural constraints into a fast sequence-only model, aiming to detect evolutionarily remote ARGs that homology-based pipelines often miss.
2. Core idea: ARG function is frequently conserved at the level of active-site geometry even when global sequence identity collapses. GeoARG leverages this by training a multimodal teacher (sequence PLM E(3)-equivariant structure GNN) and transferring that knowledge to a lightweight student that needs sequence input only at inference.
3. Architecture highlights: sequences are embedded with a pretrained protein language model; predicted structures (via ESMFold) are converted into residue graphs; an E(3)-equivariant GNN encodes rotation/translation-invariant geometry; cross-attention fuses sequence and structure; dual-level distillation aligns both logits and internal representations.
4. Practical deployment focus: one unified student pipeline supports four common screening inputs—LSnt, SSnt, LSaa, SSaa—avoiding separate models for long vs short sequences and for nucleotide vs protein inputs (nucleotide handled via six-frame translation and longest ORF selection).
5. Benchmark resource contribution: GeoARG-DB integrates seven public ARG sources, removes SNP-only resistance records, deduplicates at 100% identity, and standardizes labels into 36 resistance classes, yielding 40,624 non-redundant ARG proteins for training/validation/testing.
6. Performance summary: on curated-protein (UniProt) and genome-derived ORF negative backgrounds, GeoARG shows strong binary discrimination (e.g., LSaa accuracy 0.9987, MCC 0.9973; ORF benchmarks MCC > 0.88 and AUROC > 0.97 across input settings) and improves 36-class subtype prediction accuracy (0.8792 vs 0.7380 ARGNet and 0.6260 DeepARG).
7. Why geometry matters (ablation): removing structural features or cross-attention reduces performance, with the largest drops in short-fragment settings (SSnt/SSaa), consistent with geometry compensating when sequence context is limited.
8. Specificity stress test: against 8,793 viral proteins (a stringent, evolutionarily distant negative set), GeoARG maintains higher specificity (LSaa 0.9611; SSaa 0.9062) than ARGNet, HMD-ARG, and DeepARG, suggesting reduced spurious calls on non-bacterial proteins.
9. Remote and emerging resistance: for mobilized colistin resistance (mcr) genes, GeoARG achieves high recall on phylogenetically expanded mcr-like sequences, especially on amino-acid inputs where sequence-only baselines degrade under divergence (e.g., expanded recall LSaa 0.9718 vs 0.6665 for ARGNet; SSaa 0.9532 vs 0.6281).
10. Metagenomic discovery: screening unannotated human gut ORFs from GMGC with <25% identity to GeoARG-DB yields 1,485 high-confidence novel ARG candidates (P > 0.8) across six classes (glycopeptide, beta-lactamase, MLS, phenicol, aminoglycoside, tetracycline). Pfam enrichment supports biological plausibility (e.g., CAT, VanS-like domains, beta-lactamase folds, aminoglycoside-modifying enzyme domains).
11. Structural plausibility checks: representative candidates show strong structure-level conservation despite low sequence identity (e.g., a beta-lactamase candidate at 21.11% identity aligns with TM-score 0.88 and motif-level RMSD 0.32 Å; AlphaFold3 co-folding places ampicillin in a canonical pocket with similar pose; MD simulations keep ligand RMSD ~1–3 Å with persistent hydrogen bonds).
12. Interpretability via counterfactuals: alanine substitutions in beta-lactamase catalytic motifs reduce GeoARG confidence in a motif-specific way, with SXXK disruption causing the largest probability drop and pairwise motif disruptions compounding effects—consistent with known catalytic roles rather than generic sequence cues.
13. Efficiency payoff from distillation: the deployable student (ESM2-35M) is ~15.4× faster than the multimodal teacher (ESM2-650M E(3) GNN) on A100 inference, enabling large-scale metagenomic screening while still benefiting from structure during training.
💻Code:
github.com/XingqiaoLin/GeoAR…
📜Paper:
biorxiv.org/content/10.64898…
#AntimicrobialResistance #AMR #AntibioticResistance #Metagenomics #ProteinLanguageModels #GeometricDeepLearning #KnowledgeDistillation #Bioinformatics #ComputationalBiology #Resistome