Generative pretraining for drug molecule design with bidirectional structure-property optimization
1. The paper presents BiSP-GP, a single pretrained framework that supports both controllable molecule generation (properties and/or scaffolds as conditions) and SMILES-to-property prediction, using one unified autoregressive sequence modeling setup rather than separate task-specific models.
2. A key idea is to turn continuous properties into “language”: QED, LogP, and SAS are serialized into semantic token sequences (property identifier, sign, digits, decimal point, and digit position tokens). This keeps numerical precision while letting properties be modeled in the same token space as SMILES, avoiding the usual “properties as plain numeric constraints” design.
3. Architecture: dual Transformer encoders (structure encoder for SMILES/scaffolds; property encoder for property-token sequences) plus a cross-modal decoder with cross-attention. The decoder enables bidirectional mapping: (a) generate SMILES conditioned on properties/scaffolds, and (b) generate property tokens conditioned on SMILES.
4. Pretraining uses five self-supervised objectives: SMILES reconstruction, property reconstruction, cross-modal intra-modal contrastive learning, conditional SMILES generation, and SMILES-conditioned property generation. The contrastive part includes a soft-label strategy (via momentum encoder) to reduce false negatives among structurally similar molecules with similar properties.
5. Robustness mechanism: stochastic masking of conditions. With 50% probability, an entire property’s tokens are replaced by [UNK], exposing the model to missing/incomplete property settings and enabling flexible inference-time control (choose which properties to constrain by providing tokens; leave others as [UNK]).
6. Unconditional generation (1,000 samples) is compared to CharRNN, LatentGAN, MolGPT, SPMM, and GP-MoLFormer. BiSP-GP reports the best composite V*U*N*I score (0.804) with strong validity (0.986), near-perfect uniqueness (0.999), high novelty (0.926), and high internal diversity (0.882), aiming for a better novelty–diversity balance than several baselines.
7. Single-property conditional generation (targets across QED, LogP, SAS) is evaluated with mean absolute deviation (MAD) for control accuracy plus Moses quality metrics. BiSP-GP shows the lowest MAD across all three properties versus CMGN, Scaffold-GGM, and SPMM, while maintaining strong uniqueness and internal diversity under constraints.
8. Multi-property control is tested for QED-LogP, QED-SAS, LogP-SAS, and QED-LogP-SAS conditions. The model maintains validity/uniqueness/novelty > 0.9 across scenarios and produces property distributions clustered around targets, while leaving unconstrained properties broadly distributed—useful for realistic multi-objective optimization.
9. Scaffold-conditioned and scaffold property generation: on 100 unseen scaffolds, BiSP-GP keeps scaffold similarity ratio (Sim_ratio) > 0.8 while generating novel variants; similarity analyses suggest novelty comes from both out-of-distribution scaffolds and side-chain diversification. Joint scaffold multi-property constraints still preserve scaffold structure with property values concentrated near targets.
10. Practical case study: PAK1 inhibitor optimization. With a fixed scaffold and a reduced LogP target (from 4.70 down toward 2.50 while holding QED and SAS), generated candidates show improved docking scores on PAK1 (PDB: 4EQC) on average (~0.35 kcal/mol better than the reference) and introduce additional polar interactions while retaining a key H-bond with GLU-315.
11. Property prediction as sequence generation: on 1,000 unseen molecules, BiSP-GP generates grammatically valid property strings and achieves very high agreement with RDKit-computed values (R²: LogP 0.999, QED 0.997, SAS 0.987). It remains reliable on randomized SMILES, suggesting learned structure–property relationships are not brittle to SMILES syntax variation.
12. Transfer learning: using the pretrained structure encoder as a frozen feature extractor plus a lightweight head, BiSP-GP performs strongly on MoleculeNet tasks plus Malaria and CEP, with statistically supported gains over several baselines on many regression/classification datasets; y-scrambling checks indicate performance is not driven by label artifacts.
13. Ablations indicate both innovations matter: replacing property serialization with numeric embeddings degrades conditional control (notably LogP MAD) and lowers property-prediction R²; removing contrastive learning broadly reduces generation quality, controllability, and prediction accuracy—supporting the role of cross-modal alignment.
💻Code:
github.com/xmubiocode/BiSP-G… (Zenodo:
zenodo.org/records/20115955)
📜Paper:
doi.org/10.1038/s42004-026-0…
#ComputationalChemistry #Cheminformatics #MolecularGeneration #DrugDiscovery #Transformers #FoundationModels #GenerativeAI #PropertyPrediction #ScaffoldHopping #RepresentationLearning