A Survey of Large Language Models for Text-Guided Molecular Discovery: from Molecule Generation to Optimization
1.This is the first focused survey on using large language models (LLMs) for molecule generation and optimization, introducing a novel taxonomy based on learning paradigms—covering both tuning-free (e.g., zero-shot, in-context learning) and tuning-based (e.g., supervised fine-tuning, preference tuning) methods.
2.The survey highlights how LLMs are uniquely positioned for molecular discovery due to their emergent capabilities—such as in-context learning, reasoning, and instruction following—which allow them to generalize across diverse chemical tasks without task-specific retraining.
3.In molecule generation, LLMs are deployed via prompting strategies (e.g., LLM4GraphGen, MolReGPT) or adapted through supervised datasets (e.g., Mol-Instructions, LlaSMol, ChatMol). Preference-tuned models like SmileyLlama and Mol-MoE show improved fidelity to molecular constraints.
4.For molecule optimization, the review examines how LLMs refine existing molecules through goal-directed editing. Strategies include zero-shot optimization (LLM-MDE), retrieval-augmented prompting (ChatDrug), and evolution-based in-context learning (MOLLM, LLM-EO).
5.The survey identifies a trend toward hybrid frameworks combining fine-tuned worker models with external reasoning agents (e.g., MultiMol, DrugAssist), often leveraging GPT-4o or domain-specific scoring functions to enhance candidate selection and validation.
6.Multi-modal modeling is a growing focus, with models like UniMoT and Molx-Enhanced LLM incorporating graph or 3D inputs into LLMs via specialized tokenizers and embedding schemes, enabling structurally-aware generation and optimization.
7.Benchmarking frameworks are categorized into structure-based (validity, uniqueness, diversity) and property-based (LogP, QED, synthetic accessibility, Pareto-optimality) metrics. The paper also provides a detailed summary of standard datasets for pretraining and evaluation.
8.The survey emphasizes the limitations of current LLMs: hallucinations, lack of transparency, and domain-incoherent outputs. Future work should prioritize trustworthy generation, interpretability, and error-aware prompting to enhance reliability.
9.Emerging directions include LLM-driven agent frameworks that integrate external tools (e.g., retrosynthesis engines, docking software) for iterative design, as well as cross-modal models that jointly encode chemical topology, text, and spatial information.
10.A continuously updated repository of LLM-centric molecular research is provided at github, making this survey a central resource for the field.
💻Code:
github.com/REAL-Lab-NU/Aweso…
📜Paper:
arxiv.org/abs/2505.16094
#LLM #MoleculeGeneration #MolecularOptimization #DrugDiscovery #ChemLLM #AI4Science #InContextLearning #SMILES #MolecularDesign #LargeLanguageModels