Biology AI Daily

Biology AI Daily

Users
Tweets

16 Oct 2025

Same Model,Better Performance: The Impact of Shuffling on DNA Language Models Benchmarking 1. A new study by Greco and Rawlik from the University of Edinburgh highlights a critical issue in benchmarking DNA language models (DNA LMs) - the impact of data shuffling on model performance. The authors demonstrate that seemingly minor implementation details, such as the number of data loading workers and buffer sizes, can create significant performance variations of up to 4% for identical models. 2. The study focuses on BEND (Benchmarking DNA Language Models), a popular benchmarking framework. The authors show that BEND's implementation inadvertently introduces dependencies on hardware-specific hyperparameters, leading to biased training dynamics and affecting both absolute performance and relative model rankings. 3. The core problem stems from inadequate data shuffling interacting with the unique characteristics of genomic data, such as spatial dependencies and sequence overlap. The authors propose a simple yet effective solution: pre-shuffling data before storage. This approach eliminates hardware dependencies while maintaining efficiency. 4. Experiments with three DNA language models - HyenaDNA, DNABERT-2, and ResNet-LM - confirm that pre-shuffling significantly improves performance across all models. For instance, pre-shuffling increases the CpG methylation task performance by 4% compared to the default BEND implementation. 5. The study emphasizes the importance of considering domain-specific data characteristics when designing benchmarks. It highlights how standard machine learning practices can interact unexpectedly with genomic data, leading to unintended biases. This work provides valuable insights for benchmark design in specialized domains. 6. The authors also discuss the broader implications of their findings, suggesting that pre-shuffling should be a standard practice in benchmarking frameworks to avoid implementation artifacts that compromise evaluation validity. 7. The code for this study is publicly available at github.com/baillielab/BEND, allowing researchers to replicate and build upon these findings. 📜Paper: arxiv.org/abs/2510.12617 #DNALanguageModels #Benchmarking #Genomics #MachineLearning #DataShuffling #ComputationalBiology

862

Applied Sciences MDPI

Applied Sciences MDPI

@Applsci

29 Sep 2025

📢 #highlycited paper 📚 Analyzing #DataReference Characteristics of #DeepLearning Workloads for Improving #BufferCache Performance 🔗 mdpi.com/2076-3417/13/22/121… 👨‍🔬 by Jeongha Lee et al. 🏫 Ewha University #datashuffling #neuralnetwork #performance

GSPANN Technologies

GSPANN Technologies @gspanntech

20 Oct 2023

#GoogleBigQuery can simplify the migration process by redefining workflows that align with warehouse operations. Read our #whitepaper to learn key design aspects that would help in the successful implementation of #BigQuery. bit.ly/G_BigQuery #dataprivacy #datashuffling

GSPANN Technologies

GSPANN Technologies @gspanntech

1 Aug 2023

GSPANN Technologies

GSPANN Technologies @gspanntech

21 Dec 2021

Download our #whitepaper to learn key design aspects that would help in the successful implementation of #BigQuery: bit.ly/G_BigQuery #whitepaper #googlebigquery #bigdata #datascience #googleanalytics #datamodeling #datawarehouse #datastrategy #dataprivacy #datashuffling

GSPANN Technologies

GSPANN Technologies @gspanntech

24 Feb 2021

#GoogleBigQuery can help you in #datashuffling. Learn about the best approach for #datamodeling in #BigQuery. bit.ly/3lpA9jb #BigData #DataScience #DataStrategy #DataPrivacy #DataStorage #DataProcessing #GoogleAnalytics #MachineLearning #GoogleCloud #DataFusion

#Google BigQuery, a fully-managed service, can help you in shuffling data efficiently. Download our white paper to learn about the best approach for data modeling in #BigQuery.

#WhitePaper #BigData #DataScience #DataModeling #DataStrategy #DataPrivacy #DataShuffling #DataStructures #GoogleBigQuery #DataStorage #DataTypes #DataAndAnalytics #DataProcessing #GoogleAnalytics #MachineLearning #GoogleCloud #Python #GCPCloud #DataFusion

ALT #Google BigQuery, a fully-managed service, can help you in shuffling data efficiently. Download our white paper to learn about the best approach for data modeling in #BigQuery. #WhitePaper #BigData #DataScience #DataModeling #DataStrategy #DataPrivacy #DataShuffling #DataStructures #GoogleBigQuery #DataStorage #DataTypes #DataAndAnalytics #DataProcessing #GoogleAnalytics #MachineLearning #GoogleCloud #Python #GCPCloud #DataFusion

GSPANN Technologies

GSPANN Technologies @gspanntech

8 Oct 2020

#Google #BigQuery can help you in #datashuffling efficiently. Download our #whitepaper now: bit.ly/3lpA9jb #BigData #DataScience #DataModeling #DataStrategy #DataPrivacy #DataStructures #GoogleBigQuery #DataStorage #DataTypes #DataAndAnalytics #DataProcessing

ALT #Google #BigQuery, a fully-managed service, can help you in shuffling data efficiently. Download our white paper to learn about the best approach for data modeling in #BigQuery. #WhitePaper #BigData #DataScience #DataModeling #DataStrategy #DataPrivacy #DataShuffling #DataStructures #GoogleBigQuery #DataStorage #DataTypes #DataAndAnalytics #DataProcessing