BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences
1. BINSEQ and VBINSEQ are two new binary formats designed to drastically improve the performance of DNA sequence data processing. Compared to gzip-compressed FASTQ, these formats can deliver up to 32x faster processing with equal or better storage efficiency.
2. The key innovation of BINSEQ is its use of fixed-size records and a dense two-bit encoding scheme, allowing true random access and efficient parallel parsing. This means no decompression bottlenecks and seamless multithreaded processing.
3. VBINSEQ extends these ideas to variable-length sequences while preserving high parallel performance. It supports optional quality scores, ZSTD compression, and block-wise organization for efficient indexing and access.
4. In benchmark tests, BINSEQ and VBINSEQ significantly outperformed traditional formats like FASTQ, BAM, and CRAM in sequence reading, k-mer counting, and genome alignment tasks. The difference becomes especially pronounced as thread counts increase.
5. For k-mer counting, BINSEQ formats continued to scale linearly up to 128 threads, showing 8–16x higher throughput than FASTQ, which plateaus at just 4–8 threads due to I/O bottlenecks.
6. In genome alignment tasks using minimap2 and STAR, VBINSEQ consistently outperformed FASTQ. Performance gains were especially notable with long-read data and high thread counts, revealing faster data delivery to computation pipelines.
7. The authors developed Rust-based high-performance libraries with bindings for C and C , as well as a command-line tool for format conversion. The formats were also integrated with minimap2 and STAR to demonstrate real-world utility.
8. While FASTQ’s flexibility made it a long-standing standard, BINSEQ and VBINSEQ offer a compelling alternative for modern genomics. They unlock full CPU potential, reduce I/O constraints, and simplify data processing in high-throughput environments.
9. Practitioners can choose BINSEQ for maximum efficiency with fixed-length reads (e.g., Illumina) or VBINSEQ for flexible support of long-read data (e.g., ONT, PacBio), including quality scores and optional compression.
📜Paper:
biorxiv.org/content/10.1101/…
#bioinformatics #genomics #NGS #computationalbiology #sequencing #dataformats #parallelcomputing #RustLang