MolViBench: Evaluating LLMs on Molecular Vibe Coding
1 MolViBench is introduced as the first benchmark explicitly targeting “Molecular Vibe Coding”: chemists describe molecular tasks in natural language, and LLMs must generate executable RDKit-centric Python programs that are also chemically correct.
2 The benchmark contains 358 curated, real-world molecular coding tasks spanning 12 drug-discovery workflow categories (e.g., ADMET screening, virtual screening, reaction simulation, clustering/selection, QSAR/ML, stereochemistry, rule-based filtering, cheminformatics I/O), aiming to test the intersection of programming molecular understanding domain reasoning.
3 A key design choice is cognition-driven difficulty, organized into five Bloom-style levels: L1 single-API recall (descriptors, conversions), L2 two-step tool composition (fingerprints similarity, clustering, conformers), L3 multi-step chemical reasoning (reactions, ADMET heuristics, alerts like PAINS/Brenk), L4 branching/iteration/error recovery patterns, and L5 end-to-end pipeline creation (virtual screening pipelines, GA optimization, scaffold-split QSAR, Pareto/MPO ranking).
4 To make evaluation deterministic and comparable, tasks are constrained to RDKit as the primary toolkit (with limited auxiliary libs: numpy/pandas/scikit-learn/matplotlib/selfies). Each task has a human-authored gold reference solution under a unified interface (a single function named level_function), plus two-round quality verification for executability and chemical correctness.
5 The evaluation framework goes beyond plain unit-test exact match, because molecular tasks often have heterogeneous outputs and multiple valid solution paths. It uses a 3-stage scheme: (i) executability checks (compile entry-point runtime on multiple inputs with timeout), (ii) type-aware output comparison (tolerances for floats, canonical SMILES equivalence, recursive checks for structured outputs), and (iii) an AST-based API-semantic fallback that measures whether the generated code uses the appropriate RDKit functional categories when outputs are incomparable or when alternative valid implementations exist.
6 The paper reports systematic results on 9 frontier coding LLMs across three inference paradigms: Direct Generation (single shot), Incremental Repair (execute feed traceback for up to 3 repair rounds), and Agent Collaboration (Coder–Tester loop where a tester proposes test plans and validates plausibility).
7 Best overall performance is achieved by Claude Opus 4.6 Thinking with Incremental Repair (Pass@1 = 39.7%, Executable rate = 98.9%, Fallback Pass Rate = 72.6%). Across models, executability can be high while chemical/output correctness remains a bottleneck, motivating the multi-layer evaluation.
8 Performance drops monotonically with higher cognitive levels for all models; notably, every model remains below 10% Pass@1 on Level 5 tasks, highlighting a persistent gap in synthesizing long-horizon, end-to-end molecular discovery pipelines (planning, orchestration, and robust glue code).
9 Error analyses illustrate domain-specific failure modes: subtle RDKit API confusions that change return types (e.g., QED.qed vs QED.default), “over-engineering” outputs (returning dict metadata instead of requested scalar/SMILES), and chemically wrong reaction SMARTS due to small mapping/aromaticity constraints that yield fundamentally different products.
10 Incremental Repair is found to be consistently robust across models and levels, while Agent Collaboration is more sensitive: it can help weaker models but may degrade stronger models by introducing unnecessary rewrites that break partially correct solutions.
💻Code:
github.com/phenixace/MolViBe…
📜Paper:
arxiv.org/abs/2605.02351
#LLM #CodeGeneration #Cheminformatics #RDKit #DrugDiscovery #Benchmark #MolecularAI #VirtualScreening #QSAR #ADMET