Scaling down protein language modeling with MSA Pairformer
1. The article introduces MSA Pairformer, a novel protein language model that achieves state-of-the-art performance with significantly fewer parameters compared to existing models. MSA Pairformer uses a memory-efficient architecture to process multiple sequence alignments (MSAs) and extract evolutionary signals relevant to a query sequence.
2. A key innovation of MSA Pairformer is its query-biased outer product operation, which selectively weights sequences based on their evolutionary relevance to the query sequence. This allows the model to capture subfamily-specific co-evolutionary signals within large protein families, addressing a limitation of previous MSA models.
3. MSA Pairformer demonstrates superior performance in unsupervised contact prediction, outperforming models like ESM2-15B by 6% points while using two orders of magnitude fewer parameters. It also shows substantial improvements in predicting contacts at protein-protein interfaces, with a 24% point increase over MSA Transformer.
4. Unlike single-sequence models that struggle with variant effect prediction as they scale, MSA Pairformer maintains strong performance in both contact prediction and zero-shot variant effect prediction. This highlights its ability to balance evolutionary signal extraction and functional prediction.
5. Ablation studies reveal that triangle operations in MSA Pairformer help remove indirect correlations between residues, enabling more accurate contact predictions. Additionally, MSA Pairformer does not hallucinate contacts after removing covariance from MSAs, unlike MSA Transformer.
6. The model’s ability to extract subfamily-specific signals and its robustness to MSA perturbations open new avenues for biological discovery, including the potential to explore alternative conformations and interactions within protein families.
7. MSA Pairformer challenges the current scaling paradigm in protein language modeling by demonstrating that parameter efficiency and biological insight can synergistically advance the field. It enables efficient adaptation to rapidly expanding sequence databases and paves the way for more sustainable and scalable protein language models.
@yoakiyama @ZhidianZ @sokrypton
📜Paper:
biorxiv.org/content/10.1101/…
💻Code:
github.com/yoakiyama/MSA_Pai…
#ProteinLanguageModeling #MSAPairformer #Bioinformatics #ProteinStructurePrediction #MachineLearning