SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?
1. The paper introduces SMDD-Bench, an agentic, multi-turn, long-horizon benchmark for evaluating LLM agents on realistic small-molecule drug design workflows, moving beyond single-turn chemistry QA and toy tasks.
2. SMDD-Bench contains 502 guaranteed-solvable task instances spanning 5 task types: 2D Pharmacophore Identification (25), Interaction Point Discovery (25), Scaffold Hopping (52), Lead Optimization (340), and Fragment Assembly (60), covering 102 unique protein targets and broad chemical space.
3. A key methodological contribution is “witness-aware task generation”: for task types where solvability is not guaranteed by default (Scaffold Hopping, Lead Optimization, Fragment Assembly), each instance is constructed together with a hidden witness molecule that is known to satisfy the evaluation criteria, ensuring every task is solvable by construction without human curation.
4. The benchmark is designed to test capabilities needed for practical computational medicinal chemistry: chemical/biological reasoning, 3D geometric intuition, planning under limited expensive oracle calls, and tool use (Python/RDKit workflows, structure/interaction analysis), rather than only knowledge recall.
5. Evaluation is fully automated (no human grading) and uses a tool stack typical of computational drug design pipelines: RDKit for chemistry and filters, PLIP for interaction fingerprints, OpenBabel utilities, Boltz2 for protein–ligand co-folding plus binding probability/affinity, and ADMET-AI for multi-property prediction.
6. Each task type targets a distinct real-world skill: (a) 2D Pharmacophore ID requires writing a Python function that generalizes from 10 actives 10 inactives to hidden ChEMBL actives/inactives; (b) Interaction Point Discovery requires predicting 3 conserved pocket “hotspots” (3D coordinates type) derived from large co-crystal ensembles; (c) Scaffold Hopping requires low 2D similarity but matching 3D interaction patterns; (d) Lead Optimization requires multi-objective improvement while holding other properties constant under hard constraints; (e) Fragment Assembly requires linking 1–2 posed fragments while preserving pose and binding.
7. Benchmarking 7 frontier LLMs with a minimalist ReAct-style agent (no internet, obfuscated IDs, limited oracle budgets: 8 Boltz2 calls 15 ADMET-AI calls) shows substantial headroom: the best model (GPT-5.4) solves 40.2% overall, with most wins coming from Lead Optimization rather than 3D-heavy tasks.
8. Results highlight a consistent weakness in 3D reasoning: Interaction Point Discovery is near-zero for most models, and Scaffold Hopping / Fragment Assembly success rates are also very low, suggesting that “tool access” alone does not yield reliable 3D pocket/pose reasoning.
9. The authors analyze novelty and diversity of generated molecules. Many submitted molecules are “novel” relative to major databases (ChEMBL, SureChEMBL, PubChem, BindingDB), but repeated runs reveal limited diversity: agents often converge to similar solutions (high pairwise Tanimoto among successful outputs), which is misaligned with real lead-optimization needs where multiple diverse viable candidates are preferred.
10. An “enumeration vs. selection” study suggests agents can mention many candidate SMILES that would pass evaluation but fail to choose them for oracle calls/submission—especially in Scaffold Hopping—pointing to planning/selection as a bottleneck, not only molecule generation.
11. The paper also provides a practical adoption path via SMDD-Bench Lite (100 instances) to reduce compute barriers while keeping difficulty representative, plus SMDD-Bench Diversity (20 hard lead-optimization instances, run 10x each) to quantify output diversity and novelty under repeated trials.
12. Common failure modes observed in traces include poor cross-turn SAR synthesis (not learning exclusion rules from failed candidates), incoherent multi-turn planning (re-testing or re-proposing failed molecules), and tool-specific coding errors (e.g., malformed calls, RDKit conversion issues), emphasizing that robust agent scaffolding matters as much as base-model capability.
📜Paper:
arxiv.org/abs/2605.21740
#LLM #DrugDiscovery #MedicinalChemistry #Cheminformatics #Benchmark #Agents #ADMET #StructureBasedDesign #ComputationalBiology #RDKit