DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery
1. DrugPlayGround introduces a unified benchmark to objectively evaluate how well LLMs help drug discovery across four stages: drug function/property description, drug similarity via embeddings, drug synergy prediction, drug–protein interaction (DPI) prediction, and chemical perturbation (transcriptomic response) prediction—explicitly aiming to expose both strengths and failure modes (e.g., hallucinated chemistry).
2. A key design choice is the paired “text generation embedding” pipeline: LLMs first generate drug descriptions under controlled prompts/temperatures; embedding models then encode those descriptions for downstream ML tasks. The benchmark emphasizes leakage prevention, quantitative metrics, and expert (chemist/biologist) review to test chemical/biological reasoning rather than only surface-level performance.
3. For drug description faithfulness (862 drugs sampled from MolTextNet), the study benchmarks Claude, DeepSeek, GPT-4o, Gemini-1.5 Pro, and Mistral-large across 90 model–prompt–temperature settings (3 generations per drug). Quality is scored by BLEU, ROUGE-1/2/L, and BERTScore, plus a combined “Normalized Total” score.
4. Main text-generation findings: (i) lower temperature usually improves reference alignment, but the optimal temperature is model-dependent; (ii) prompt choice matters more than temperature for both performance and stability; (iii) “Meta” (domain-expert framing) prompts consistently improve description quality vs standard prompts, while CoT prompts reduce lexical/structural alignment and increase truncation/hallucination artifacts.
5. Reliability caveats are made concrete: even strong configurations can output incorrect numeric facts (e.g., molecular weight), wrong formulas/functional groups/stereochemistry, or overgeneralized pharmacology. CoT prompting is particularly associated with “reasoning text” that degrades factual alignment, and some models produce structured-looking chemistry that is not necessarily correct.
6. Embedding evaluation separates “representation fidelity” from “task utility.” Using GPT-4o (Meta, T=0.0) to generate descriptions, the authors compare embedding models (text-embedding-3-large, Gemini embedding, mistral-embed, Gemma-300m, Qwen3-Embedding-8B) by cosine similarity to ground-truth MolTextNet embeddings. Most achieve high similarity (>0.7) except Qwen3-Emb; Mistral-Emb is strongest, suggesting embedding quality is not simply tied to parameter scale.
7. In drug synergy prediction (BAITSAO framework; multiple synergy datasets with cell-line context), LLM-derived embeddings outperform a structure-focused molecular foundation model baseline (UniMol) and also outperform direct “LLM inference” (GPT-5.1 as a QA classifier). Gemini-Emb and Mistral-Emb are top overall across classification and regression metrics (AUROC/ACC; PCC/R2).
8. The benchmark adds mechanistic error analysis with domain experts: the same drug pair (5-FU dasatinib) can be predictable in one context (VCaP; AR-driven, more homogeneous) but unpredictable in another (MSTO-211H; heterogeneous, redundant signaling). The takeaway is that synergy predictability depends strongly on clarity of drug mechanism descriptions and how well cell state/driver biology is defined; adding efficacy-linked details (e.g., EC values) to descriptions may improve downstream performance.
9. For DPI prediction, drug embeddings from LLMs are paired with fixed protein embeddings (ESMC) and evaluated on TDC datasets (Human, DrugBank, C. elegans). LLM drug embeddings generally outperform domain-specific/structure-only embeddings, but “best embedding” is dataset-dependent: GPT embeddings are advantageous for Human, while Gemini/Mistral often lead on DrugBank; Gemini/Qwen3 perform better on C. elegans. Higher description-generation temperatures can help DPI, plausibly by increasing functional detail diversity in text.
10. For chemical perturbation prediction (Tahoe 100M; ~1,100 compounds across 50 cancer cell lines; ChemCPA model), swapping RDKit-style baselines for LLM-derived drug embeddings consistently improves R2. Best average performance is reported with Qwen3-Emb using GPT-4o descriptions at T=0.4 (high R2 with low variance), while some configurations increase variance—highlighting a performance/robustness trade-off. Qualitative examples suggest biologically grounded annotations (e.g., “tetracycline antibiotic”) support better perturbation prediction than descriptions dominated by physicochemical properties.
💻Code:
github.com/HelloWorldLTY/dru…
📜Paper:
biorxiv.org/content/10.64898…
#DrugDiscovery #LLM #Benchmark #Embeddings #Chemoinformatics #Bioinformatics #DrugSynergy #DTI #SingleCell #PerturbationSeq