Improving spliced alignment by modeling splice sites with deep learning
1.The paper introduces minisplice, a compact deep learning model using 1D-CNN to predict splice site probabilities across genomes, which improves spliced alignment accuracy for both mRNA and protein sequences.
2.A key innovation is integrating splice site scores directly into popular aligners like minimap2 and miniprot, allowing them to better resolve ambiguous alignments around introns—especially for long noisy reads and proteins from distantly related species.
3.Unlike conventional aligners that rely on simple motifs like GT..AG or PWM models, minisplice leverages a 7,026-parameter CNN trained across vertebrate and insect genomes, capturing both conserved splice signals and clade-specific features such as GC-rich introns in mammals and birds.
4.The model uses a 202bp sequence window around candidate GT or AG sites and converts raw neural network scores into empirical probabilities using known annotations, enabling probabilistic scoring compatible with alignment algorithms.
5.Extensive evaluation shows that using minisplice scores reduces unannotated (likely incorrect) junctions from 14% to 4.4% in protein-to-genome alignments (zebrafish to human), and from 20.7% to 5.6% (mosquito to fruitfly).
6.Performance improvements are consistent across different sequence identity bins, with splice-aware scoring substantially reducing junction error rates even at low identity. For RNA-seq data, error rates dropped from 1.4% to 1.0%, with more pronounced gains on older or noisier reads.
7.Minisplice is implemented in C with minimal dependencies and outputs splice scores that can be reused by other tools. It doesn't replace the aligners but enhances them with deeper biological insight from sequence context.
8.Cross-species generalization was tested with models trained on multiple insects and vertebrates. A joint vi2 model performed nearly as well as species-specific models, and significantly outperformed models trained on distant species when applied to new genomes.
9.Analysis of CNN activations and UMAP clustering revealed that the model captures both canonical splicing signals and broader compositional features of intronic and exonic regions. This includes species-specific elements like mammalian GC-rich introns.
10.Minisplice focuses only on GT..AG splice sites, and while this covers most introns, it doesn't model rare splice variants like GC..AG or AT..AC. Still, the improvement in alignment accuracy and simplicity of integration makes it highly practical.
11.The authors emphasize that minisplice complements, rather than competes with, larger models like SpliceAI. Its strengths lie in efficiency, interpretability, and direct applicability to genome annotation and alignment tasks.
@lh3lh3
💻Code:
github.com/lh3/minisplice
📜Paper:
arxiv.org/abs/2506.12986v1
#bioinformatics #genomics #deeplearning #RNAseq #proteinalignment #splicing #computationalbiology