Protein Language Models Encode Evolutionary Grammar but Conflate Topological and Thermodynamic Phases
1. Wang et al. probe what a sequence-only protein language model (ESM-2 3B) actually encodes by stress-testing it on key “Anfinsen exceptions”: intrinsically disordered proteins (IDPs), fold-switching proteins, and knotted proteins—cases where sequence-to-structure is not a single static mapping.
2. Core result: ESM-2 largely discards microscopic 3D backbone geometry during embedding formation, and instead builds a macroscopic “sequence grammar manifold” shaped by evolutionary statistics and physicochemical composition—useful for separating biological from unphysical sequences, but weak for topology/phase distinctions.
3. To test microscopic geometric awareness, the study uses Hasimoto integrability error E[n], a differential-geometric order parameter tied to backbone twisting/folding symmetry breaking. Residue-level correlations between embedding distances and E[n] are negligible (overall Spearman ρ ≈ 0.105; R² ≈ 0.015), arguing against atomic-detail geometry being represented in the latent space.
4. Global latent structure: PCA of 11,068 proteins reveals a horseshoe-shaped manifold. Random sequences form a clearly isolated cluster (Silhouette ≈ 0.344 in 50D PCA; per-sample mean ≈ 0.566), indicating strong sensitivity to “evolutionary plausibility” of sequences.
5. The main manifold axes map to composition more than geometry: PC2 correlates with hydropathy (GRAVY), pI, and especially aromaticity (ρ ≈ 0.364), creating hydrophilic–hydrophobic gradients. SCOP classes show partial ordering, consistent with statistical secondary-structure preferences rather than explicit coordinate encoding.
6. Key limitation: “topological aliasing.” IDPs, knotted proteins, and fold-switching proteins are not separable in ESM-2 space (negative Silhouette means: knotted ≈ −0.151, fold-switching ≈ −0.108, IDP ≈ −0.057). The model conflates physically distinct topological/thermodynamic regimes when sequence statistics overlap.
7. A region-replacement control argues the conflation is intrinsic, not just mean-pooling “dilution.” Replacing annotated distinctive regions (disordered segments / knot regions / fold-switching interface units) with matched ASTRAL95 regions barely changes Silhouette scores (shifts ~0.0–0.6%), implying the limitation is not localized to a removable motif.
8. Density behavior in latent space inverts physical entropy: using KDE on UMAP-2D, IDPs (physically high conformational entropy) occupy the densest latent regions (IDP density ~1.36× ASTRAL95 baseline), interpreted as low evolutionary sequence entropy being compressed into tight manifold neighborhoods.
9. Mechanistic explanation via topology “gauge” geometry: persistent homology separates random vs biological classes at a macroscopic level (large Wasserstein-2 distances), but holonomy-defect analysis shows class-invariant local curvature (“geometric turbulence”; tiny effect sizes η²), explaining why local neighborhoods fail to resolve fine topological/thermodynamic phases.
10. Structure-aware control: SaProt (sequence Foldseek 3Di tokens) partially reduces aliasing for static anomalies like knots (Silhouette knotted: −0.106 → 0.008; 8% → 56% positive), but still cannot separate alternative fold states in fold-switching proteins (conf1 vs conf2 Silhouette ≈ −0.002), suggesting static structural tokens help topology but not multi-state thermodynamic phase behavior.
💻Code:
github.com/wyqmath/ESM-Laten…
📜Paper:
biorxiv.org/content/10.64898…
#ProteinLanguageModels #ESM2 #ComputationalBiology #ProteinFolding #IntrinsicDisorder #FoldSwitching #ProteinKnots #TopologicalDataAnalysis #RepresentationLearning #Biophysics