BOLEK: A Multimodal Language Model for Molecular Reasoning
1. BOLEK targets a key pain point in molecular ML: predictions that are either opaque (just a score) or “explanations” that aren’t checkable against the actual molecule. The model is designed so its natural-language reasoning can be audited using verifiable molecular features.
2. Core idea: inject a molecular embedding directly into an instruction-tuned text decoder. BOLEK extends Qwen3-4B-Instruct with a single Morgan fingerprint token (2048-bit, radius=2) mapped into the LLM embedding space via a small learned projector, then trains everything end-to-end.
3. The alignment recipe is deliberately “first-principles” rather than caption-style. Instead of mostly mapping molecules to descriptive prose, BOLEK is trained to answer many concrete questions about the molecule: (a) free-text structural/property descriptions, (b) substructure presence/absence, and (c) numeric descriptor prediction.
4. Alignment scale and coverage: >850k molecules drawn from MolPILE (700k), KnowMol (90k), ChEBI-20-MM (26k), plus naming sets. Tasks include regression of 88 RDKit/Mordred descriptors and detection of 1,403 substructures spanning MACCS keys, RDKit fragments, SMARTS catalogs, and toxicophore/alert lists.
5. Downstream training uses 15 TDC binary classification endpoints (e.g., AMES, BBB, HIA, hERG, Pgp, HIV, and multiple CYP inhibition/substrate tasks). BOLEK is trained in one supervised fine-tuning run mixing alignment downstream examples, with both yes/no and chain-of-thought (CoT) formats.
6. A notable ingredient: CoT supervision is synthetic but feature-anchored. For each training molecule, the rationale prompt includes (i) a literature-derived mechanistic preamble for the endpoint, (ii) SMILES, (iii) a decomposition into named parts with local annotations, and (iv) values of the top 20 RDKit descriptors chosen by random-forest feature importance. The CoT is generated and filtered to match the ground-truth label.
7. Predictive results on the 15 TDC tasks: BOLEK improves over its Qwen3-4B-Instruct base on all 15 tasks in yes/no mode, and on 13/15 in CoT mode. Mean ROC/PR AUC rises from 0.55 (Qwen3) to 0.76 (BOLEK) in yes/no mode. Despite being < half the size, BOLEK outperforms TxGemma-9B-Chat on 13/15 binary tasks.
8. Groundedness evaluation is treated as a first-class metric: BOLEK mentions concrete numerical descriptors 10–100× more often per CoT than Qwen3, TxGemma, or GPT-5.4. When it cites values, they align well with RDKit for canonical features (e.g., TPSA, MolLogP, MolWt; Spearman ρ ≈ 0.87–0.91), highlighting “auditable” rationales rather than purely qualitative prose.
9. Representation ablation (fingerprint vs SMILES) shows complementarity: fingerprint input wins on many enzyme/transporter tasks driven by substructure/shape (Veith CYPs, CYP substrates, Pgp, bioavailability), while SMILES can be stronger on tasks with broader token-level cues or multi-mechanism signals (e.g., HIV) and on some permeability/tox endpoints. This supports the paper’s view that no single representation dominates across endpoint families.
10. Generalization beyond the trained endpoints: on 15 unseen TDC classification tasks, BOLEK improves over Qwen3 zero-shot and matches/exceeds TxGemma on several non-Tox21 endpoints (e.g., PAMPA, ClinTox, skin reaction, SARS-CoV-2 Touret, M1 antagonist). On 3 held-out regression tasks (lipophilicity, PPBR, solubility), BOLEK shows non-trivial rank correlations despite never being trained on downstream regression.
📜Paper:
arxiv.org/abs/2605.02745
#ComputationalBiology #Cheminformatics #DrugDiscovery #MultimodalAI #LLM #QSAR #ExplainableAI #MolecularML #TDC #RDKit