MolDeTox: Evaluating Language Model’s Stepwise Fragment Editing for Molecular Detoxification
1 MolDeTox introduces a stepwise benchmark that tests whether LLMs/VLMs can detoxify molecules via minimal, localized edits: identify toxicity-relevant fragments, propose non-toxic replacements, then generate a full non-toxic analog while preserving physicochemical properties.
2 The benchmark is built on a new dataset, ToxicityCliff: ~52,885 toxic/non-toxic molecule pairs across 49 toxicity endpoints, curated so each pair is globally similar but differs locally (toxicity label flips with small structural changes), reflecting realistic “design-around-toxicity” scenarios.
3 A key design choice is using SAFE, a fragment-level molecular representation (BRICS-based) that makes substructures explicit. MolDeTox uses SAFE both for interpreting molecules (fragment identification) and for generation (generate SAFE then decode to SMILES), aiming to reduce invalid structures and improve edit locality.
4 MolDeTox decomposes detoxification into three QA tasks with single-step vs multi-step variants: Task 1 toxic fragment identification (Ft-only), Task 2 non-toxic fragment generation (Fnt-only), Task 3 full non-toxic molecule generation (Mnt). This makes failures diagnosable rather than only scoring end-to-end success.
5 Data construction emphasizes “minimal-edit property preservation”: candidate pairs are selected with high similarity thresholds (scaffold/ECFP4/Levenshtein), then filtered with IQR-based rules to remove outliers in fragment counts/lengths and in key RDKit properties (MW, logP, TPSA, HBD, HBA, RotB).
6 Compared with prior detoxification benchmarks that often rely on proxy toxicity predictors, MolDeTox evaluates by matching to real non-toxic counterparts from curated pairs, avoiding single-model toxicity scoring but making the task harder and more reflective of exact-edit requirements.
7 Results show a consistent difficulty ladder: Task 1 > Task 2 > Task 3. Even strong models struggle on full-molecule exact match (Task 3), especially in multi-step edits, highlighting that fragment-level reasoning does not reliably translate into correct end-to-end reconstruction.
8 In-context learning is the most reliable boost. For example, GPT-5.2 with 4-shot improves Task 1 single-step accuracy (to ~54%) and substantially increases Task 3 single-step exact match under SAFE generation (to ~15.6%), indicating models benefit from retrieved, structurally similar examples.
9 SAFE-based generation improves chemical validity and property retention versus direct SMILES generation across models: reported gains include higher fingerprint similarity, higher validity, and higher PRS (Property Retention Score), supporting the idea that fragment-first generation better preserves the unchanged parts of a molecule.
10 The paper adds step-dependency analysis (8 outcome cases across T1/T2/T3 correctness), showing most Task 3 failures are “complete breakdown” (T1=0,T2=0,T3=0), while successes usually coincide with correct intermediate steps—evidence that the decomposition is informative for diagnosing where detoxification pipelines fail.
💻Code:
github.com/datamol-io/safe
📜Paper:
arxiv.org/abs/2605.12181
#ComputationalBiology #Cheminformatics #LLM #MolecularOptimization #DrugDiscovery #Toxicology #Benchmark #MultimodalAI #GenerativeAI #FragmentBasedDesign