Biology AI Daily

Biology AI Daily

Users
Tweets

Jun 13

Interpretable enzyme function prediction via sparse autoencoder features of ESMC across the microbial protein universe 1. The paper proposes a “features-first” route to enzyme function prediction: instead of training a new deep model, it uses ESMC-6B’s sparse autoencoder (SAE) features as an interpretable semantic signature to predict EC numbers. 2. Key enabler: the ESMC SAE expands layer-60 hidden states into a 16,384-feature codebook with Top-K=64 sparsity, and each feature has an independently interpretable biological concept label (generated/annotated via multi-agent GPT-5), enabling mechanistic explanations alongside predictions. 3. Benchmark setup: 4,868 reviewed microbial SwissProt enzymes (Bacteria Archaea), balanced across 7 EC1 superclasses and spanning 161 EC3 subclasses; proteins are 80–700 aa and have ≥3 EC levels. Protein representations are mean-pooled SAE activations, evaluated with simple linear probes. 4. Main EC3 result (161-way classification, 80/20 split): SAE binary features (just the top-64 active concepts per protein) reach 78.9% top-1 and 88.5% top-5 accuracy, outperforming a 3-mer logistic regression baseline (57.3% top-1) and approaching BLASTp (80.5% top-1). 5. Using richer SAE information improves further: the combined binary activation weights representation reaches 85.6% top-1 and 90.5% top-5, exceeding BLASTp top-1 by 6.4% in this benchmark. 6. Practical advantage over homology transfer: BLASTp returns no hit for 12.6% of test proteins, while SAE features yield predictions for 100% of queries; for the BLASTp no-hit subset, SAE binary still achieves 62.1% top-5 accuracy. 7. “Dark-matter regime” analysis: test proteins are binned by maximum 3-mer Jaccard similarity to training data; in the lowest-similarity bin (<0.20, containing 61% of the test set), SAE top-5 accuracy is 0.656 vs 0.438 for the 3-mer baseline, showing robustness when sequence similarity is minimal. 8. Generalization to unseen enzyme classes: leave-one-EC3-class-out evaluation (60 populous EC3 subclasses held out) trains only an EC1 classifier and tests whether it recovers the correct EC1 superclass for a completely unseen EC3 class; SAE binary achieves 47.7% EC1 recovery (3.3× random; vs 26.6% for sequence features). 9. Interpretability check: the most discriminative SAE features align with known enzymology—e.g., hydrolases linked to α/β-hydrolase catalytic triad/nucleophilic elbow geometry; oxidoreductases to Rossmann NAD(P)H-binding motifs; transferases to P-loop/Walker A phosphate-binding patterns; translocases to multi-helix transmembrane bundle concepts—supporting mechanistic plausibility of the learned “concept” features. 10. Atlas-scale survey: scanning 7.7M ESM Atlas cluster representatives, the authors identify 169,859 “dark enzyme-like” candidates (uncharacterized clusters with enzyme-suggestive Pfam keywords) spanning major microbial phyla, positioning SAE-based signatures as a scalable prioritization tool for experimental enzyme discovery. 💻Code: github.com/YueHuLab/esmc-sae… 📜Paper: arxiv.org/abs/2606.12209 #ProteinLanguageModels #EnzymeDiscovery #ECNumber #Interpretability #SparseAutoencoders #Microbiome #Metagenomics #ComputationalBiology #ESMC #ESMAtlas

2,229

GitHub - Bitbol-Lab/ProteomeLM: ProteomeLM: A proteome-scale language model allowing fast predict...

Scaling antibody language models improves structure aware representation for antibody engineering