Coalescence and Translation: A Language Model for Population Genetics
1.This study introduces cxt, a decoder-only transformer model that reframes coalescent time inference in population genetics as a sequence modeling problem, analogous to language translation. It translates local mutation patterns into coalescent times, enabling rapid and flexible inference of evolutionary history from genetic data.
2.Unlike traditional SMC-based methods, cxt does not rely on handcrafted likelihoods. Instead, it is trained on simulated data to predict the time to the most recent common ancestor (TMRCA) in an autoregressive fashion, using a next-coalescence prediction objective.
3.cxt achieves state-of-the-art accuracy across a range of demographic scenarios, outperforming or matching both Singer (ARG-based) and Gamma-SMC (SMC-based) models. It is particularly effective in non-constant population size models, such as sawtooth and island scenarios.
4.The model is highly efficient: cxt can infer over a million pairwise coalescence times in under five minutes using a single NVIDIA A100 GPU, allowing for genome-scale analysis at population scale.
5.A key innovation is its ability to generate well-calibrated approximate posterior distributions over TMRCAs. This enables principled uncertainty quantification, a feature often lacking in other deep learning approaches.
6.cxt generalizes well out-of-distribution, as shown by its performance on unseen species and demographic models in stdpopsim v0.3. Fine-tuning further improves accuracy in novel settings, demonstrating its potential as a flexible foundation model.
7.The model’s broad training across stdpopsim v0.2 species allows it to act like a context-sensitive database, capable of returning different coalescence distributions depending on the input, without needing retraining.
8.cxt’s posterior predictions were validated using empirical coverage and variance under varying mutation rates. Results show consistent calibration and sensitivity to mutational density, even in challenging inference windows.
9.The method was applied to 1000 Genomes Project data, successfully identifying known genomic regions with atypical coalescence histories—e.g., a recent sweep at the LCT locus and deep coalescent times in the HLA region—demonstrating its utility in real-world data.
10.Architecturally, cxt modifies GPT-2 to handle continuous mutational input projected into latent space. Coalescent times are discretized and learned autoregressively, with rotary positional embeddings encoding local genomic distance.
11.The authors introduce the "next-coalescence prediction" task, analogous to next-token prediction in language models, but tailored to sequential TMRCA estimation. This formulation enables structured learning of genealogical processes.
12.The approach is scalable, generalizable, and conceptually distinct from prior DL-based tools like CoalNN or ReLERNN. By modeling the generative process directly, cxt can capture richer evolutionary patterns without relying on predefined summary statistics.
13.The authors view cxt as a foundational model for coalescent inference, with future directions including modeling selection, multi-lineage coalescence, and extending to larger sample sizes or new species via fine-tuning.
14.While the current implementation uses fixed 2kb windows, the authors discuss tradeoffs in resolution, model complexity, and memory efficiency, suggesting avenues for optimization such as adaptive window sizes and local attention mechanisms.
15.Crucially, cxt requires only mutation rates for calibration, bypassing the need for recombination rate inputs. This simplifies deployment while maintaining accuracy and adaptability across species and parameter regimes.
💻Code:
github.com/kr-colab/cxt
📜Paper:
biorxiv.org/content/10.1101/…
#PopulationGenetics #CoalescentTheory #LanguageModels #Genomics #DeepLearning #TMRCA #ARG #TransformerModel #SimulationBasedInference