Large-scale data-driven pre-trained DNA models enhance performance across diverse genomics tasks
1. The paper introduces SUCCEED, a supervised multi-task DNA foundation model pretrained on 6,389 ENCODE functional genomics tracks, aiming to learn transferable regulatory representations that can be adapted across many downstream genomics tasks with minimal retraining.
2. SUCCEED’s core design is a lightweight hybrid CNN–Transformer: convolutional layers learn local motif features, while a Transformer encoder models long-range regulatory dependencies; several LLM-inspired upgrades are included (SwiGLU, RMSNorm, RoPE, and grouped-query attention) to improve stability and efficiency.
3. In a DNA-only benchmark against Enformer, SUCCEED achieves comparable or better performance despite reduced architectural complexity; for example, it improves CAGE prediction (PCC 0.76 vs 0.703), is similar on histone ChIP-seq, slightly below on TF ChIP-seq, and close on DNase/ATAC.
4. On standard short-sequence genomic benchmarks (promoter and splice-site tasks), fine-tuning the pretrained SUCCEED yields a higher mean accuracy than training from scratch (0.906 vs 0.891) and is competitive with (or better than) large self-supervised DNA language models on most tasks.
5. Interpretability analyses indicate SUCCEED learns biologically meaningful features: first-layer filters recover known TF motifs (via TOMTOM/JASPAR matching), and Input×Gradient attributions suggest predictions rely on both local motifs and distal sequence context.
6. The work emphasizes multi-scale transfer: models trained at 131 kb inputs can be fine-tuned to longer contexts (e.g., 524 kb, 1 Mb, 2 Mb) and different resolutions with strong performance; updating only the prediction head (or head Transformer) can outperform training from scratch while reducing compute and accelerating convergence.
7. For unseen cell types, SUCCEED is tested on scATAC-seq-derived pseudo-bulk profiles from 45 human brain cell types; fine-tuning is computationally cheaper and can match (or sometimes exceed) de novo training, suggesting the pretrained model captures broadly reusable regulatory “grammar”.
8. To predict cell-type-specific epigenomic profiles, SUCCEED is extended with an ATAC-seq encoder to inject cell-state information; in cross-chromosomal and cross-cell-type evaluations, it generally outperforms EPCOT, with especially strong gains for histone mark prediction and competitive TF-binding prediction.
9. SUCCEED is also used as a prior for denoising/enhancing chromatin accessibility: it outperforms AtacWorks on bulk and scATAC-seq, remains robust under extreme low coverage (e.g., 0.2M reads), and can reconstruct accessibility from very small cell counts (reported as single-cell input approaching conventional performance that typically needs far more cells).
10. For 3D genome modeling, SUCCEED improves training stability and can predict cell-type-specific Hi-C contact patterns; notably, it can reconstruct 3D architecture without requiring CTCF ChIP-seq input, and it remains effective when driven by sparse scATAC-seq (including small numbers of cells), supporting scalable 3D inference where Hi-C/CTCF data are unavailable.
📜Paper:
doi.org/10.1038/s41467-026-7…
#ComputationalBiology #Genomics #DeepLearning #FoundationModels #ENCODE #Epigenomics #ATACseq #HiC #3DGenome #TransferLearning