Learning Human T Cell Behaviors through Generative AI Embeddings of T Cell Receptors
1.This paper introduces Tarpon, a generative AI model trained on over one million human TCR sequences, that creates interpretable embeddings capturing T cell behaviors across development and disease.
2.Tarpon embeddings can stratify CD4 and CD8 T cell fates across fetal and adult stages using only CDR3 sequences, independent of V/J gene identity, revealing physicochemical signatures that track developmental trajectories.
3.Using Tarpon, the authors developed AgFlow, a flow-based generative model that learns antigen-specific TCR distributions and can generate de novo, physiochemically realistic TCRs specific to antigens like SARS-CoV-2 YLQPRTFLL.
4.AgFlow generated YLQ-specific TCRs that matched experimentally validated activation profiles and even reproduced dominant public TCRs absent from training data, demonstrating potential for in silico TCR discovery.
5.Tarpon’s embeddings correlate with physicochemical properties, enabling interpretable latent dimensions; TCRs cluster by hydrophobicity, aromaticity, and charge, supporting downstream biological insights and mining efforts.
6.Embeddings also reflect antigen specificity: different viral antigens (e.g., CMV, IAV) occupy distinct, learnable regions in latent space, with shared features across immunogenic TCRs suggesting a conserved “immunogenicity signature.”
7.Cross-dataset mapping revealed that CD4/CD8 TCR distinctions learned in fetal thymus generalized to adult cancer and COVID-19 cohorts; fetal-trained models even outperformed adult-trained ones on fetal data.
8.Atypical adult CD8 T cells with “CD4-like” TCRs were associated with severe COVID-19 and higher inflammation markers, suggesting that deviations from fetal-like TCR rules may relate to disease susceptibility.
9.Tarpon uncovered hidden heterogeneity within fetal type I innate T cells, mapping them to adult MAIT and KIR T cell subsets. Transcriptomic analyses confirmed this dual identity, highlighting new developmental links.
10.The model outperforms traditional CDR3-based distance metrics like TCRdist in identifying biologically meaningful distinctions and is computationally efficient (100M embeddings in under 20 minutes on a single GPU).
11.By generating de novo, antigen-specific TCRs and offering interpretable latent features, Tarpon provides a scalable, open framework for TCR repertoire analysis, synthetic TCR design, and cross-cohort immune profiling.
💻Code:
github.com/danielgchen/Tarpo…
📜Paper:
biorxiv.org/content/10.1101/…
#Immunology #TCR #GenAI #DeepLearning #SingleCell #SyntheticBiology #CD4 #CD8 #COVID19 #CancerImmunology