Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases
1.SHEPHERD is a new deep learning method that helps diagnose rare genetic diseases by using only a few labeled examples, overcoming the key limitation of data scarcity in this field.
2.Unlike traditional diagnostic models that rely heavily on large datasets, SHEPHERD is trained primarily on 40,000 simulated patients spanning 2,000 rare diseases, allowing it to generalize to new and atypical cases.
3.SHEPHERD leverages a biomedical knowledge graph containing relationships between genes, phenotypes, and diseases, using graph neural networks to embed patient data into a structured latent space.
4.The model excels at three major tasks: causal gene discovery, finding “patients-like-me” with similar genotypic and phenotypic profiles, and characterizing novel disease presentations through interpretable embeddings.
5.When tested on 465 patients from the Undiagnosed Diseases Network (UDN), SHEPHERD correctly ranks the causal gene in the top 1 in 40% of cases and in the top 5 in 85% using expert-curated gene lists.
6.SHEPHERD outperforms 12 benchmark methods including LIRICAL, HiPhive, and LLaMA models in gene prioritization across expert-curated and variant-filtered gene lists.
7.For patients with novel diseases or genes lacking known phenotype associations, SHEPHERD achieves up to 86% win rates in correctly prioritizing causal genes, outperforming all baselines in almost every subgroup.
8.The model also learns meaningful patient embeddings—patients with the same disease cluster together in the embedding space (AMI = 0.304), allowing accurate retrieval of similar cases across cohorts.
9.SHEPHERD can retrieve “patients-like-me” from independent cohorts like MyGene2, outperforming Phrank in similarity search and reducing the number of patient comparisons needed by 17.2%.
10.Beyond prediction, SHEPHERD provides interpretable summaries of unknown syndromes by estimating similarity to known disease categories, offering actionable insights for clinicians investigating novel presentations.
11.Two UDN case studies demonstrate that SHEPHERD accurately prioritizes causal genes even when patients exhibit highly atypical symptoms not directly linked to known gene-disease associations.
12.The model supports flexible integration into the diagnostic workflow: it can assist after clinical workup, during variant review, or for downstream analysis when investigating new candidate genes.
13.SHEPHERD is trained in a disease-stratified manner to ensure generalization to unseen conditions, with no overlap between training and validation diseases.
14.Its use of synthetic patient data ensures privacy, enabling public model release without compromising real patient confidentiality.
15.While existing models often depend on direct gene-phenotype links, SHEPHERD captures indirect associations through multi-hop graph reasoning—critical for diagnosing poorly characterized or novel disorders.
16.The embedding attention mechanism offers partial interpretability, highlighting which phenotype features contributed most to the model’s predictions.
17.Limitations include reliance on the quality of the knowledge graph and underrepresentation of non-European populations in training data, pointing to opportunities for broader data inclusion and variant-level integration.
18.SHEPHERD showcases how few-shot, knowledge-guided deep learning can transform rare disease diagnosis, reducing expert burden and shortening diagnostic delays in real clinical settings.
💻Code:
huggingface.co/spaces/emilya…
📜Paper:
nature.com/articles/s41746-0…
#RareDisease #Genomics #DeepLearning #FewShotLearning #BiomedicalAI #GraphNeuralNetworks #PrecisionMedicine