When a protein embedding is indistinguishable from noise
Protein language models have become the backbone of computational biology. Feed them an amino acid sequence, and they return a dense vector—a compact numerical fingerprint that downstream models use to predict function, structure, localization, or the effect of a mutation. The assumption, largely unquestioned, is that this fingerprint actually encodes meaningful biology.
Prabakaran and Bromberg challenge that assumption directly. They ask a deceptively simple question: how do you know whether a given embedding actually represents a protein—or whether it's just noise dressed up as a vector?
Their answer is the Random Neighbor Score (RNS). The idea is elegant: generate biologically meaningless sequences by randomly shuffling the residues of real proteins—preserving amino acid composition but destroying all evolutionarily meaningful interactions. Then, for each real protein, measure how many of its nearest neighbors in latent space are these random imposters. A high RNS means the model never learned to place that protein somewhere biologically meaningful.
Applied to ESM-2 and ProtT5 across thousands of proteins, RNS correlates strongly with structural prediction quality: proteins with poorly predicted structures have embeddings nearly indistinguishable from random sequences. Downstream tasks follow the same pattern—contact prediction precision drops roughly 40% for high-RNS proteins, and variant effect prediction falls to near chance. Most sobering: between 19% and 46% of the human proteome is underlearned by current models, depending on architecture. Intrinsically disordered regions fare especially poorly across all architectures tested.
RNS is model-agnostic and computationally cheap—around two minutes on GPU for 10,000 proteins—making it a practical prescreening step before any embedding-based inference.
For R&D teams that routinely use protein embeddings to prioritize variants, annotate novel sequences, or screen large libraries, this has immediate consequences. Running RNS before downstream inference flags proteins where predictions are unreliable, reducing the risk of propagating errors into expensive wet-lab campaigns. It also offers a principled way to identify gaps in model coverage—directly actionable for teams building or fine-tuning their own foundation models.
Paper: R. Prabakaran & Yana Bromberg, Nature Methods (2026) — CC BY-NC-ND 4.0 |
nature.com/articles/s41592-0…