DEL Simulator: A Digital Twin for Understanding Machine Learning on DNA-Encoded Libraries
1. The article introduces a digital twin – an in-silico DEL simulator – that models the underlying chemistry and selection processes of DNA-encoded libraries (DELs) as a function of key design parameters. This includes read count, cycles of selection, one-step reaction yield, and library size. The simulator provides a statistically principled way to understand and analyze DEL experiments via an interpretable model for DEL data generation.
2. The simulator systematically investigates how design parameters influence downstream machine learning (ML) virtual screening. It identifies specific regimes where preprocessing steps such as disynthon aggregation can significantly enhance screening performance. Notably, it shows that increasing library size can degrade ML-based screening performance, challenging the common assumption that larger libraries always lead to better outcomes.
3. The DEL Simulator comprises seven modular components: Library Generation, Affinity Modeling, Affinity Selection and Readout, Data Processing and Aggregation, Model Training, Virtual Screening, and Model Evaluation. This modular design allows for flexibility and extensibility, making it a powerful tool for researchers to explore different experimental setups and ML strategies.
4. The simulator generates synthetic ground-truth affinity data, allowing researchers to run a variety of in-silico DEL hit identification campaigns with varied experimental parameters. This approach provides insights into the effectiveness of different data analysis techniques, which is crucial for optimizing DEL campaigns.
5. The study uses the DEL Simulator to construct two types of 3-cycle DEL libraries (LIB-A and LIB-B) and runs in-silico prospective DEL-ML hit ID campaigns against two targets: MK14 and sEH. The results highlight the impact of experimental parameters on the performance of ML models, demonstrating the utility of the simulator in guiding experimental design.
6. The DEL Simulator enables rapid exploration of how experimental parameters affect data quality. For instance, increasing the number of selection cycles or read count improves the correlation between observed counts and true affinities, while decreasing reaction yield degrades this correlation due to increased noise from truncates.
7. The article concludes that the DEL Simulator serves as a realistic digital twin of experimental DEL screens, producing chemically grounded, high-fidelity data that can be used to benchmark analysis pipelines, design better selection strategies, or train ML models. Future work may leverage this platform to systematically investigate strategies for enhancing library diversity and optimizing ML strategies in DEL screening.
📜Paper:
doi.org/10.26434/chemrxiv-20…
#DELsimulator #MachineLearning #DNALibraries #DrugDiscovery #ComputationalBiology