Reverse–Complement Consistency for DNA Language Models
1. This paper introduces Reverse-Complement Consistency Regularization (RCCR), a novel method to enhance the reliability of DNA language models by ensuring predictions are consistent between a DNA sequence and its reverse complement. This addresses a critical issue where existing models often fail to capture the inherent symmetry of DNA sequences, leading to inconsistent and unreliable predictions.
2. RCCR is a model-agnostic fine-tuning objective that directly penalizes the divergence between a model’s prediction on a sequence and the aligned prediction on its reverse complement. It is applicable across diverse tasks including sequence classification, scalar regression, and profile prediction, demonstrating its versatility and broad applicability.
3. The method is evaluated on three different backbone models—Nucleotide Transformer, HyenaDNA, and DNABERT-2—and shows substantial improvements in robustness and accuracy. RCCR reduces prediction flips and errors while maintaining or improving task accuracy compared to existing methods like RC data augmentation and test-time averaging.
4. RCCR incorporates a key biological prior directly into the learning process, making it an intrinsically robust and computationally efficient solution. It produces a single, robust model without doubling inference cost, unlike test-time averaging.
5. Theoretical guarantees are provided, showing that symmetrization is risk non-increasing under RCCR and that global minimizers are RC-consistent with RC-symmetric labels. This ensures that enforcing agreement during training does not sacrifice task performance.
6. RCCR introduces a compact RC robustness suite (SFR, RC-Corr) to standardize the reporting of orientation robustness alongside task metrics. This allows for more comprehensive and comparable evaluations across different models and tasks.
7. The experiments include a negative control on strand-specific prediction, demonstrating that RCCR is not suitable for tasks that require explicit RC variation. This highlights the importance of applying RCCR appropriately based on the biological context of the task.
8. The authors conclude that RCCR is a powerful tool for improving the reliability and interpretability of DNA language models by directly encoding a fundamental biological prior. Future work could explore extending this approach to other biological symmetries and generative models.
📜Paper:
arxiv.org/abs/2509.18529
#DNALanguageModels #ReverseComplement #Genomics #ModelRobustness #Bioinformatics