seqLens: Optimizing Language Models for Genomic Predictions
1. The paper introduces seqLens, a family of DNA language models designed for genomic predictions, leveraging disentangled attention with relative positional encoding to enhance performance across multiple genomic tasks.
2. Unlike conventional DNA language models that rely on fixed k-mer tokenization, seqLens employs byte-pair encoding (BPE), allowing dynamic token lengths that improve convergence and reduce computational overhead.
3. The study systematically compares seqLens models to existing state-of-the-art DNA models, demonstrating superior performance in 13 of 19 benchmarking tasks, including phenotype prediction and genome annotation.
4. Two large-scale pretraining datasets were used: one containing 19,551 reference genomes (predominantly bacterial) and another with a broader taxonomic balance, including eukaryotic and archaeal genomes, covering 180 billion nucleotides.
5. seqLens models integrate multiple architectural improvements, including the DeBERTa-inspired disentangled attention mechanism, which separately encodes content and positional information, leading to more efficient DNA sequence modeling.
6. The study explores various fine-tuning strategies, including full fine-tuning and parameter-efficient adaptation methods like LoRA, revealing trade-offs between computational efficiency and predictive accuracy.
7. Domain adaptation through continual pretraining significantly enhances performance on specialized genomic tasks, enabling seqLens models to effectively transfer learned representations across different genomic datasets.
8. Benchmarking results indicate that seqLens models with smaller vocabularies tend to generalize better than those with larger tokenizers, highlighting the importance of balancing vocabulary size and training efficiency.
9. Alternative pooling strategies, including mean and max pooling, are evaluated for classification tasks, with mean pooling outperforming the commonly used CLS token representation in multiple genomic benchmarks.
10. Future work aims to refine seqLens by incorporating multimodal genomic data, exploring reinforcement learning-based optimization, and expanding applications in metagenomics and regulatory sequence modeling.
๐Paper:
biorxiv.org/content/10.1101/โฆ
#GenomicAI #LanguageModels #Bioinformatics #ComputationalBiology #MachineLearning