Embed-Search-Align: DNA Sequence Alignment using Transformer Models
1. Introducing Embed-Search-Align (ESA), a novel framework leveraging Transformer-based Reference-Free DNA Embedding (RDE) to align DNA sequences with unmatched efficiency and accuracy, rivaling traditional methods like Bowtie and BWA-Mem.
2. Key innovation: ESA transforms genome-wide sequence alignment into a vector search task, enabling efficient identification of top-matching fragments through a specialized DNA vector store.
3. RDE achieves 99% accuracy in aligning 250-length reads to a human reference genome, significantly outperforming 6 recent DNA-Transformer baselines like Hyena-DNA and DNABERT-2 in terms of both precision and scalability.
4. Unique features: Self-supervised training with contrastive loss allows RDE to generate rich embeddings, preserving sequence locality in the embedding space and enabling robust cross-species and cross-chromosome alignment.
5. ESA reduces computational complexity, achieving a speed of aligning 10,000 reads per minute while maintaining high accuracy. This represents a step forward in aligning reads for large and complex genomes.
6. Real-world implications: ESA’s superior performance in aligning short reads from simulated and experimental datasets offers transformative potential for genomics, including variant calling, transcriptomics, and epigenomics.
7. Looking ahead: ESA’s framework paves the way for advanced applications like pan-genome alignment and de novo genome assembly, with promising initial results on species like Thermus aquaticus.
@LajoyceMboning @KCEnevoldsen
📜Paper:
arxiv.org/abs/2309.11087
#Genomics #DNAAlignment #Transformers #MachineLearning #Bioinformatics #SequenceAnalysis #GenomeAssembly