Leveraging Large Language Models for Literature-Driven Prioritization of Protein Binding Pockets
1.This study presents a hybrid pipeline that integrates geometric pocket prediction (via Fpocket) with Large Language Models (LLMs) to prioritize biologically relevant protein binding pockets using experimental literature evidence.
2.The key innovation is using LLMs to extract residue-level binding site information directly from research papers and use that to filter and refine geometrically predicted pockets—automating a task traditionally reliant on expert manual curation.
3.The authors developed a curated benchmark dataset of 10 proteins and 35 annotated papers, including diverse scenarios: no binding site, one known site, or multiple sites—allowing robust LLM evaluation.
4.The LLM pipeline consists of three stages: paper filtering (relevance detection), pocket extraction (residue identification), and pocket refinement (error correction and format enforcement), all using direct prompting without complex reasoning chains.
5.Prompt optimization and a final refinement step increased Pocket Number Accuracy from 0.48 to 0.71, Pocket Specificity from 0.46 to 0.657, and maintained perfect Pocket Recall (1.0).
6.For each target, extracted pockets were mapped onto 3D PDB structures using a clustering algorithm that accounts for chain variations, structural inconsistencies, and multimeric interfaces—yielding spatially resolved binding sites.
7.The final volumetric representation of each pocket is computed by filtering Fpocket alpha spheres against LLM-extracted residues, converting them to a grid format, and trimming volumes using convex hulls to eliminate solvent-exposed artifacts.
8.This approach successfully unified binding site descriptions across multiple publications, enabling more consistent identification of ligand-accessible regions in proteins like GABAA, MLKL, M2 receptor, and Nav1.7.
9.The benchmark revealed limitations in Fpocket’s native output (e.g., site fragmentation or over-merging), which were mitigated by the LLM-assisted filtering and merging process based on spatial residue proximity.
10.The authors provide an open-source benchmark dataset and curated markdown-formatted articles to support further development of LLM-based literature mining tools for structural biology.
11.This study showcases the growing potential of LLMs to automate literature-based knowledge extraction for practical drug discovery tasks—reducing reliance on human domain expertise in structure-based modeling workflows.
💻Code:
github.com/MelnychenkoM/LLM-…
📜Paper:
biorxiv.org/content/10.1101/…
#DrugDiscovery #LLM4Bio #ProteinBindingSites #BindingPockets #Fpocket #MolecularModeling #AI4Science #StructuralBiology #LiteratureMining #PDB #Bioinformatics #MachineLearning #HybridMethods