Understanding Protein Language Model Scaling on Mutation Effect Prediction
1. This study investigates why larger protein language models (pLMs), such as ESM2-3B and ESM2-15B, often underperform compared to medium-sized models like ESM2-650M in predicting mutation effects, despite achieving lower perplexity.
2. The authors show that mutation effect prediction performance peaks when model-predicted perplexity for a protein lies within the range of 3–6. Models outside this range tend to output indiscriminate log-likelihood ratios (LLRs), failing to distinguish deleterious from neutral mutations.
3. At low perplexity (typical for large models), LLRs are overly negative for most mutations, collapsing dynamic range. At high perplexity (typical for small models), LLRs cluster near zero. Both extremes diminish predictive resolution.
4. The study uses two large benchmarks—ProteinGYM and a mega-scale protein stability dataset—to show that model performance follows a rise-then-fall trend with respect to perplexity, robust across proteins with diverse sequence homology and structural contexts.
5. The authors link LLM-predicted amino acid distributions with conservation profiles from MSAs, finding highest agreement (lowest KL divergence) in the 3–6 perplexity range, where pLMs implicitly recapitulate evolutionary conservation.
6. Residue-level analyses reveal that low perplexity models lose specificity for contextually important positions, whereas optimal-perplexity models better capture both average residue fitness and substitution-specific effects.
7. Practical guidance is offered: if a model yields very low perplexity on a protein, switch to a smaller model; if perplexity is too high, fine-tune the model using homologous sequences to bring it into the optimal zone.
8. The study cautions that pLMs are not trained to model conservation but to predict wild-type residues, which can cause large models to overfit to dominant sequences, leading to poor mutation differentiation in highly conserved proteins.
9. Proteins outside natural evolutionary constraints—like viral or designed proteins—show lower performance even at ideal perplexity, indicating that pLMs may not generalize to synthetic biology without retraining on function-specific objectives.
10. The authors propose future training regimes that dynamically exclude overly predictable proteins during pLM training to optimize for mutation effect prediction, and advocate reconsidering perplexity as a universal benchmark for model quality.
📜Paper:
biorxiv.org/content/10.1101/…
#ProteinLM #MutationEffect #Perplexity #ProteinEngineering #ComputationalBiology #ESM2 #ModelScaling #AI4Science #Bioinformatics