LAMBDA: A Prophage Detection Benchmark for Genomic Language Models
1 LAMBDA introduces a bacterial-domain benchmark that asks whether genomic language models (gLMs) learn whole-genome sequence features, using phage vs bacteria discrimination and genome-wide prophage localization as the central stress test.
2 The benchmark is organized into four tiers of increasing realism: (i) probing tasks on frozen embeddings, (ii) fine-tuning for peak performance, (iii) diagnostic tests (GC bias, FPR/FNR asymmetry, PHROG functional strata), and (iv) genome-wide prophage detection by scanning complete bacterial genomes with overlapping windows.
3 A key design choice is measuring representational value directly: LAMBDA compares probes trained on pretrained embeddings vs the same architectures with random weights, reporting ÎMCC. Across models, pretraining yields large gains, and non-linear probes add only modest improvements over linear probesâsuggesting strong models make phage/bacteria largely linearly separable in embedding space.
4 LAMBDA evaluates diverse gLMs spanning architectures, scale, tokenization, and training corpora (e.g., ProkBERT, DNABERT-2, Nucleotide Transformer v2, GENERanno, megaDNA, Caduceus, EVO2), enabling controlled conclusions about what matters for bacterial annotation tasks.
5 Results emphasize training data relevance over sheer parameter count: models trained on prokaryotic/phage-heavy corpora (e.g., ProkBERT-mini, GENERanno) perform strongly despite being far smaller than frontier-scale models, while models trained predominantly on human DNA (e.g., DNABERT-2, Caduceus) lag on bacterial prophage tasks.
6 Diagnostic tests show GC-composition explains only a small fraction of performance (near-zero MCC on GC-preserving shuffled sequences), but error modes differ sharply across models: fragment-level false positive rates can be extremely high for some architectures (e.g., strong âphageâ bias), motivating per-model and per-task diagnostics rather than relying on a single aggregate score.
7 Functional validation uses PHROG categories on phage CDS sequences, showing sensitivity varies by gene class: âHead & Packagingâ and âTailâ genes are easiest across models, while âIntegration & Excisionâ and âNo PHROG matchâ are harderâhighlighting where sequence representations capture canonical phage biology vs where âviral dark matterâ remains challenging.
8 Genome-wide prophage detection is substantially harder than balanced fragment classification: raw window predictions inflate false positives, especially due to âhard negativesâ like genomic islands/ICEs and degraded prophage remnants. LAMBDA therefore includes a signal extraction pipeline (per-genome z-score normalization, bidirectional EMA smoothing, clustering/merging, length and score filtering) that improves MCC with minimal recall loss.
9 On 80 bacterial genomes with 386 verified prophage regions, the best gLM (EVO2) reaches region-level MCC ~0.680 after filtering, while curated smaller models remain competitive (e.g., ProkBERT-mini and Nucleotide Transformer v2 ~0.658; GENERanno ~0.648). Traditional tools still lead overall (e.g., geNomad MCC ~0.794; PHASTER ~0.786), clarifying the current gap between gLMs and homology/hybrid pipelines.
10 LAMBDA also evaluates interpretability claims: EVO2âs sparse autoencoder âprophage featureâ (f/19746) fires cleanly in some genomes but not others; after clustering it is competitive yet below an EVO2 classifier (MCC ~0.636 vs ~0.680), suggesting prophage signal may be domain-dependent and/or distributed across circuits rather than captured by a single feature.
đ»Code:
github.com/leannmlindsey/LAMâŠ
đPaper:
biorxiv.org/content/10.64898âŠ
#Genomics #Bioinformatics #ComputationalBiology #MachineLearning #DeepLearning #LanguageModels #Bacteriophage #Prophage #Microbiome #Benchmarking