Variant Classification Using Proteomics-Informed Large Language Models Increases Power of Rare Variant Association Studies and Enhances Target Discovery
1. This study introduces a proteomics-informed refinement of large language models (LLMs) to improve the classification of rare missense variants and enhance the power of rare variant association studies in human genetics.
2. The authors use plasma proteomics data from 46,665 individuals in the UK Biobank to correlate protein abundance changes with coding variants, finding strong associations between predicted deleteriousness and proteomic readouts.
3. A two-step model was built: first, an ensemble classifier trained on synonymous and pLoF variants was used to label rare missense variants; second, these labels were used to fine-tune the ESM-1b LLM, producing the ESM-1b proteomics model.
4. ESM-1b proteomics achieved higher correlation with validation proteomic assay results than standard LLMs (ESM-1b, ESM-1v, AlphaMissense), particularly in within-gene analyses of missense variant impact.
5. When benchmarked on 241 gene-trait pairs with known pLoF associations, ESM-1b proteomics recapitulated 88 associations using only singleton missense variants—outperforming all tested methods including AlphaMissense (87) and ESM-1b (83).
6. Applied to 10 complex traits in the UK Biobank, the model yielded 177 gene-trait associations at genome-wide significance, a 24.6% increase over conventional ensemble methods and a 15.7% improvement over ESM-1b.
7. Novel associations identified by the model include PCSK6 with triglyceride levels and SIX1 with hearing loss—findings missed by conventional variant classifiers, highlighting its value in target discovery.
8. ESM-1b proteomics also outperformed standard ESM-1b in classifying ClinVar variants, achieving an AUROC of 0.940 vs 0.919. Though AlphaMissense had a slightly higher AUROC (0.947), ESM-1b p showed superior performance in association studies.
9. The approach generalized to other LLMs, including ESM-1v and ESM-2 models, with proteomic fine-tuning notably improving performance, especially in smaller models.
10. This work establishes that large-scale human proteomics can be a powerful, unbiased supervisory signal to refine LLMs for variant interpretation, improving both discovery yield and mechanistic insight in genetics.
📜Paper:
biorxiv.org/content/10.1101/…
#Genomics #Proteomics #VariantClassification #LLM #RareVariants #FunctionalGenomics #ProteinLanguageModels #HumanGenetics #UKBiobank #Bioinformatics #PrecisionMedicine #TargetDiscovery