Biominer: A multi-modal system for automated mining of protein-ligand bioactivity data from literature
1. BIOMINER targets a practical bottleneck in drug discovery: bioactivity evidence is scattered across text, tables, and figures, and ligand structures are often reported as Markush definitions that require enumeration into exact molecules (SMILES) before the data are usable.
2. The key design choice is to explicitly decouple two hard problems that end-to-end LLM extraction tends to entangle: (a) biochemical semantic interpretation of bioactivity measurements, and (b) chemically valid ligand structure construction. BIOMINER runs these in parallel and then joins them via ligand coreference identifiers.
3. For chemical structures, BIOMINER introduces Chemical-Structure-Grounded Visual Semantic Reasoning (CSG-VSR): domain-specific perception models detect/recognize chemical depictions, an MLLM reasons over indexed depictions to infer scaffoldāR-group relations and coreference, and deterministic chemistry tools (OPSIN, RDKit) perform the exact symbolic construction and Markush āzippingā into full enumerated molecules.
4. The system is implemented as an agentic pipeline: document parsing (MinerU) ā chemical structure agent (MolDetv2 MOLGLYPH BIOMINER-INSTRUCT RDKit/OPSIN) and bioactivity measurement agent (BIOMINER-INSTRUCT with post-fusion across modalities) ā post-processing/integration agent that produces proteināSMILESāvalue triplets.
5. To make evaluation systematic, the paper releases BIOVISTA, a benchmark curated from 500 PDBbind-referenced publications: 16,457 bioactivity entries and 8,735 unique chemical structures, with modality distribution heavily table-driven (72.5%), plus substantial figure (11.6%) and text (15.8%) content; 48.7% of structures involve Markush representations.
6. On BIOVISTA, BIOMINER reaches F1 = 0.323 for complete bioactivity triplets (precision 0.319, recall 0.328). A one-shot end-to-end baseline essentially fails (F1 ā 0.00042), supporting the paperās argument that decomposition and tool-grounded symbolic construction are necessary for this task.
7. Component results highlight where the system is strong vs. where the field remains hard: bioactivity measurement extraction F1 = 0.626 (tables easiest; text/figures harder), ligand coreference-SMILES F1 = 0.528 (explicit structures better than Markush). Removing CSG-VSR collapses triplet F1 from 0.323 to 0.011, indicating Markush-aware structure resolution is central.
8. Error attribution suggests priorities for future work: bioactivity measurement extraction contributes 32.68% of triplet errors, OCSR 25.31%, Markush enumeration 15.91%. Chirality recognition is a major OCSR weakness (reported accuracy ~0.504 on chiral structures), and Markush recall drops notably with cross-modal R-group definitions and with three R-groups (combinatorial complexity).
9. Three applications demonstrate utility beyond benchmark scores: (a) large-scale mining from 11,683 European Journal of Medicinal Chemistry papers in ~3 days, extracting 226,076 triplets and enriching 82,262 with protein structures; pretraining GNN affinity models on this noisy-but-large dataset improves downstream RMSE by ~3.9% (and outperforms unsupervised or label-shuffled controls). (b) A human-in-the-loop workflow curates 1,592 high-quality NLRP3 data points from 85 papers in 26 hours (doubling ChEMBLās NLRP3 set), improving QSAR early enrichment (average EF1% 38.6% over 28 model settings) and yielding 16 virtual-screening hit candidates with novel scaffolds. (c) Structureābioactivity annotation on PoseBusters: HITL improves accuracy from 90.5% to 96.25% and reduces annotation time from 195.8 s to 35.0 s per entry (5.59x faster).
š»Code:
github.com/jiaxianyan/BioMinā¦
šPaper:
arxiv.org/abs/2604.21508
#ComputationalBiology #DrugDiscovery #Bioinformatics #TextMining #MultimodalAI #LLM #Chemoinformatics #OCSR #Markush #QSAR #Dataset #Benchmarking