Biology AI Daily

Biology AI Daily

Users
Tweets

Jun 13

An electron-density point-cloud framework for robust protein-ligand interaction prediction 1. E-CloudBind reframes protein–ligand affinity prediction around electron-density point clouds rather than relying on sub-ångström atomic coordinates, aiming to stay accurate when structures are low-resolution or predicted (e.g., AlphaFold) and therefore noisy. 2. Key idea: replace hard distance cutoffs for “contacts” (e.g., within 5 Å) with density-aware Gaussian “electron clouds”, where interactions are defined by overlap/isosurface intersection, yielding a more resolution-agnostic interaction graph. 3. The framework explicitly splits chemistry into two complementary channels: non-covalent interactions from 3D electron-cloud point clouds, and covalent structure from intrinsic molecular graphs (bond topology), then fuses them for affinity regression. 4. Ligand electron density is obtained via semi-empirical quantum chemistry (GFN2-xTB), while protein pockets use a van der Waals radius-guided multivariate Gaussian sampling strategy as a physically motivated proxy that is far cheaper than full QM density. 5. Architecture highlights: K-means clusters point clouds into atom-aligned local regions; a point-cloud encoder (3D-GCN-style deformable kernels) learns local non-covalent patterns (e.g., H-bonds, π-stacking, van der Waals complementarity); a heterogeneous GNN encodes covalent graphs; a multi-bond fusion module integrates both. 6. On PDBbind 10-fold CV, E-CloudBind reports MAE 1.059 and Pearson 0.667, outperforming representative sequence-based (PSICHIC), graph-based (SIGN), and structure-based (DMFF) baselines, and also comparing favorably to recent structure-centric methods (EHIGN, Boltz-2, FlowDock) under the same protocol. 7. Robustness to experimental resolution: when regressing absolute error vs. crystallographic resolution, E-CloudBind shows a much flatter slope (0.017) than baselines (0.053–0.065), with a non-significant trend (p = 0.703), indicating reduced sensitivity to declining structural quality. 8. Robustness to structure source shifts: swapping experimental proteins with AlphaFold2 models causes only a small performance change for E-CloudBind (MAE 0.042; Pearson −0.004), while coordinate-dependent baselines degrade more (e.g., DMFF MAE 0.187; Pearson −0.093). 9. Out-of-distribution testing built from DAVIS via combinatorial partitioning by protein/ligand complexity shows tighter error dispersion for E-CloudBind (lowest median deviation), with stable performance across increasing protein Relative Contact Order and ligand Bertz complexity. 10. Practical and interpretability results: attention maps highlight polar ligand atoms and key pocket regions consistent with known interaction motifs; large-scale screening on 80,383 ZINC molecules against PBP1A, SARS-CoV-2 Mpro, and BCL-2 uses docking for follow-up, plus BCL-2 candidate assessment with synthesizability metrics and explicit-solvent MD (400 ns) suggesting stable binding for selected hits. 💻Code: github.com/Liuyujian0408/DPI ; doi.org/10.5281/zenodo.19851… 📜Paper: doi.org/10.1038/s41467-026-7… #ComputationalBiology #DrugDiscovery #ProteinLigand #BindingAffinity #GeometricDeepLearning #GNN #PointCloud #ElectronDensity #VirtualScreening #AlphaFold

1,424

Biology AI Daily

Biology AI Daily @BiologyAIDaily

Jun 7

When Does Structure Help? The Information Bonus of AlphaFold2 Representations over Protein Language Models 1. The paper introduces Information Bonus (IB): a task-level metric that quantifies how much linearly accessible signal is gained by using frozen AlphaFold2 (AF2) Evoformer representations instead of a cheaper frozen sequence-only model (ESM-2), evaluated under protein-level cross-validation. 2. IB is defined as the held-out performance difference between the best AF2 representation (chosen post-hoc between Evoformer single vs pair-diagonal) and ESM-2, using the same frozen linear probe. IB > 0 means structure adds usable signal; IB < 0 means sequence embeddings are sufficient or better. 3. The most decisive positive-IB regime is allostery (AlloSigDB; 47 proteins, 9,925 residues, 4.8% positives): AF2 single achieves AUROC 0.548, while ESM-2 is below chance at 0.485 and AF2 pair-diagonal is near chance at 0.497. This suggests AF2 single encodes long-range geometric/communication-network information that is not linearly recovered from sequence alone. 4. Binding affinity (PDBbind; n=5,680 complexes) shows a strong negative IB: ESM-2 reaches Pearson r=0.449 vs AF2 single r=0.307 and AF2 pair-diagonal r=0.278 (IB = -0.141). The paper argues this likely reflects evolutionary/family-level binding constraints captured by sequence models. 5. A key experimental design choice: the affinity probe receives only protein features (no ligand representation). So the benchmark tests whether representations capture protein-level correlates of affinity (e.g., pocket druggability, family propensity), not ligand-specific complementarity; AF2 features also reflect an apo-like inference rather than the bound complex. 6. Flexibility (ATLAS MD; 268 proteins, 50,426 residues) is mixed and label-dependent. For RMSF regression, AF2 pair-diagonal is directionally best (r=0.436) vs ESM-2 (r=0.407), giving a small positive IB ( 0.030) with limited statistical power across 5 folds. 7. For within-protein median flexibility classification, ESM-2 wins clearly: AUROC 0.824 vs AF2 pair-diagonal 0.764 and AF2 single 0.762 (IB = -0.060; p=0.0017 vs AF2 pair). Interpretation: sequence context captures disorder/mobility signatures better than static geometry for this relative-flexibility label. 8. The paper highlights a residue-level leakage artifact: naive residue-wise KFold (allowing residues from the same protein in both train/test) inflates RMSF performance by 27–39% depending on representation (e.g., ESM-2 r=0.672 under leaky split vs 0.407 under protein-level GroupKFold). This inflation can reverse representation rankings and change conclusions. 9. Practical takeaway framed for AI-scientist workflows: representation choice should be a measurable decision. Start with ESM-2 when labels are plausibly driven by evolutionary constraints or disorder-like sequence signatures; pay the AF2 inference cost when the mechanism depends on long-range 3D communication (as in allostery). When uncertain, estimate IB on a small labeled set before scaling structural inference. 📜Paper: arxiv.org/abs/2606.04228 #ComputationalBiology #ProteinML #AlphaFold2 #ProteinLanguageModels #RepresentationLearning #Allostery #Benchmarking #DataLeakage #AIFORScience

2,273

Biology AI Daily

Biology AI Daily @BiologyAIDaily

Jun 3

HonestAffinity: Leak-Aware Evaluation of Protein and Pocket Priors for Binding Affinity Prediction 1. HonestAffinity frames a key caution for protein–ligand affinity models: architectural “priors” can flip from helpful to harmful depending on whether evaluation splits leak protein/ligand similarity (canonical CASF/PDBbind-style) or are leak-proof (LP-PDBBind 3-tier no-leak). 2. The paper isolates two common priors in a controlled 1D-input setting: frozen ESM-2 (650M) per-residue embeddings (1280-d projected to 256) and a learned binary pocket-position marker added to residue features when pocket annotations are available. 3. Core result is a split-conditioned reversal across both priors. On familiar/canonical splits (val, CASF-2016, CASF-2016 non-train), adding ESM-2 and the pocket marker improves performance; on strict LP no-leak tiers (test_cl1–cl3), the same additions reduce Pearson R. 4. Three deployment-matched variants are proposed rather than one “best” model: HONESTAFFINITY-POCKET (ESM-2 pocket marker) for familiar/annotated targets; HONESTAFFINITY-NOPOCKET (ESM-2 only) when no pocket list exists; HONESTAFFINITY-POCKET-NOESM (21-token residue embedding pocket marker) for strict LP-style generalization with pocket annotations. 5. Quantitatively (Pearson R, mean±std over 3 seeds): HONESTAFFINITY-POCKET leads on val (0.548), CASF-2016 (0.747), and CASF non-train (0.646). But on LP strict tiers, HONESTAFFINITY-POCKET-NOESM leads: cl1 0.531, cl2 0.538, cl3 0.497, also giving best RMSE on cl2/cl3. 6. The ESM-2 ablation is especially instructive: swapping ESM-2 for a learned 21-vocab residue embedding decreases R on val/CASF (e.g., CASF-2016 drops from 0.747 to 0.713) but increases R on every strict LP tier (e.g., cl3 rises from 0.433 to 0.497; ∆R up to 0.064). 7. The pocket marker shows the same sign flip: it helps on val/CASF but hurts on LP tiers. Interpretation: both priors inject signals correlated with training-distribution structure (protein-family signatures or pocket geometry), which can become misleading when similarity filtering removes overlap by design. 8. Methodologically, the authors argue for paired reporting: canonical metrics alone would over-credit ESM/pocket priors; leak-proof metrics alone would understate their utility for in-distribution scoring. They recommend routine paired canonical leak-proof ablations for new affinity predictors. 9. Implementation is intentionally compact and scalable: multi-scale 1D CNNs a residual block a single Transformer layer per branch (protein and SMILES), coupled by a matrix-product compatibility map. Training uses 11,513 LP-PDBBind train complexes and runs in ~3 GPU-hours on a single V100; inference is ~10 ms/complex with cached embeddings. 📜Paper: arxiv.org/abs/2606.03422 #CompBio #Bioinformatics #DrugDiscovery #ProteinLanguageModels #ESM2 #BindingAffinity #Benchmarking #PDBbind #CASF2016 #MachineLearning

1,588

Biology AI Daily

Biology AI Daily @BiologyAIDaily

May 24

Hierarchical Contrastive Learning for Multi-Domain Protein-Ligand Binding 1. HCLBind reframes protein–ligand affinity prediction for multi-domain targets as a hierarchical geometry problem: instead of treating proteins as monolithic rigid graphs, it explicitly learns both local pocket rules and global inter-domain conformational validity, which is critical when domain motions control access and recognition. 2. The key idea is to decouple representation learning from affinity regression via self-supervised pre-training on Q-BioLiP, then fine-tune on PDBbind. This reduces reliance on scarce labeled affinity data and makes the learned features more robust to flexible regions and misleading geometries. 3. A novel hierarchical decoy strategy provides structure-aware supervision at two levels: for single-domain proteins it perturbs coordinates with Gaussian noise (σ=1.5 Å) to teach local physicochemical/geometric constraints; for multi-domain complexes it rotates one domain (15°/30°) to create interface-invalid conformations that test global quaternary geometry. 4. Pre-training combines two complementary objectives: Interface Decoy Discrimination (IDD) as a binary task to separate native vs perturbed interfaces (learning “is this interface physically valid?”), and Ligand–Protein Matching (LPM) with an InfoNCE batch contrastive loss to learn physicochemical compatibility between proteins and ligands. 5. Architecturally, HCLBind uses a dual-branch multimodal encoder: ESMC for protein sequence embeddings and MolFormer for ligand SMILES embeddings, then a domain-gated graph attention network for protein structure where a learnable interface bias prioritizes inter-domain edges while masking non-neighbors. 6. Cross-modal fusion is performed with multi-head cross-attention where ligand tokens query protein structural tokens, approximating ligand “surface scanning” and encouraging the model to focus on discriminative interface features rather than diffuse global signals. 7. Parameter-efficient adaptation is done with LoRA on the protein and ligand foundation model backbones to preserve evolutionary/chemical knowledge while still adapting to binding tasks under limited supervision; ablations show removing LoRA substantially degrades accuracy. 8. For fine-tuning, HCLBind uses Evidential Deep Learning (EDL) with a Normal-Inverse-Gamma output to model both epistemic and aleatoric uncertainty, targeting the known issue that flexible linker regions inject noise and can cause overconfident predictions in deterministic regressors. 9. On a strict time-based PDBbind split (test complexes released ≥2019), HCLBind reports RMSE 1.309, PCC 0.698, C-index 0.744, outperforming multiple baselines including a strong contrastive baseline (CL-GNN). Ablations indicate IDD and LPM are both necessary, and EDL improves both performance and reliability. 10. Reliability analyses show uncertainty is actionable: rejecting high-uncertainty predictions monotonically reduces RMSE, and the model assigns significantly higher epistemic uncertainty to structurally disrupted interface decoys than to native complexes, suggesting it internalizes geometric validity signals from IDD. 💻Code: github.com/jiankliu/HCLBind 📜Paper: arxiv.org/abs/2605.19902 #ComputationalBiology #GeometricDeepLearning #DrugDiscovery #ProteinLigand #GNN #ContrastiveLearning #UncertaintyEstimation #ProteinDesign #MultimodalAI #MachineLearning

1,708

Biology AI Daily

Biology AI Daily @BiologyAIDaily

May 16

PythiaStudio: A one-stop protein engineering platform powered by Pythia model suite 1. PythiaStudio is a unified, high-throughput web platform for protein engineering that brings mutation effect prediction (stability and PPI affinity), pocket annotation, fitness scoring, and structure prediction into one workflow aimed at practical design decisions rather than isolated analyses. 2. The core capability is rapid, full saturation mutagenesis screening: for a given protein structure, the server can return complete mutation landscapes (all single substitutions across all positions) within minutes, presented as interactive heatmaps and sortable tables for prioritization. 3. The platform integrates three deep learning structure-based models from the Pythia family: Pythia for folding stability changes (ΔΔGfolding), Pythia-PPI for mutation-induced binding affinity changes (ΔΔGbind) at protein–protein interfaces, and Pythia-Pocket for identifying catalytic/ligand-binding pocket residues to help avoid functionally critical sites during stabilization. 4. Pythia’s modeling choice is notable: it is self-supervised on protein structures (CATH and PDB) and estimates mutation energies from learned amino-acid probabilities, helping reduce reliance on labeled ΔΔG datasets and avoiding potential benchmark leakage; it also emphasizes physical consistency via forward/reverse mutation antisymmetry metrics. 5. Pythia-PPI extends the same structural graph encoder with transfer learning multitask learning (binding and stability) and uses self-distillation to expand training signals (from ~4k to ~400k interface-related samples derived from SKEMPI v2), achieving strong cross-validated performance reported in the paper. 6. Pythia-Pocket is trained on PDBBind and outputs per-residue pocket likelihoods, enabling a common engineering tactic: exclude predicted catalytic/substrate-binding residues while searching for stabilizing mutations to reduce the risk of activity loss. 7. Sequence-based modules complement structure-based predictions: ESM-2 provides zero-shot evolutionary fitness scores for mutations; ESMFold generates structures from sequences when experimental structures are unavailable; an indel module (ProGen2-medium) scores single-residue deletions and alanine insertions; aggregation propensity analysis is provided via Aggrescan3D 2.0. 8. A key design concept is the integrated two-step engineering workflow to manage stability–activity trade-offs: first use Pythia to propose stabilizing mutations (recommended starting cutoff around ΔΔG < −3.0, excluding pocket residues), experimentally validate and accumulate beneficial ones; then use ESM-2 to propose activity/fitness-improving mutations on the stabilized background (recommended starting cutoff around fitness score < −1.5). 9. The paper highlights real engineering outcomes used to validate the workflow (glycoside hydrolases): Bacillus subtilis cellulase and β-glucanase were redesigned via greedy accumulation of stabilizing mutations followed by fitness-guided activity tuning, yielding variants with improved thermostability while maintaining or improving catalytic activity; similar strategy was reported to generalize to amidase, galactosidase, and transaminase projects. 10. Implementation details emphasize accessibility and reproducibility: a three-tier web architecture (React/TypeScript UI with Mol* visualization; FastAPI backend; GPU-accelerated services; job IDs with shareable URLs), no login requirement, and explicit notes that prediction reliability depends on input structure quality (recommending high-confidence experimental or AlphaFold/ESMFold models). 💻Code: github.com/Wublab/pythia ; github.com/Wublab/pythia_ppi 📜Paper: doi.org/10.1093/nar/gkag408 #ProteinEngineering #ComputationalBiology #DeepLearning #Bioinformatics #EnzymeEngineering #ProteinDesign #WebServer #GNN #ProteinStability

2,868

Sylvain Gariel

Sylvain Gariel

@SylvainGariel

May 6

Big congrats to the OpenBind team on this first release! 🚀 ~900 novel, high-quality crystallographic protein-ligand structures (most with paired experimental affinities) on a single target: EV-A71 2A protease. For context, PDBbind, the main public resource after decades of collection, has ~20k such complexes spread across thousands of proteins, meaning most targets have only a handful to low dozens. This adds rare depth on one protein. Affinity prediction remains largely unsolved in structure-based drug design. Top AI models still struggle with accurate Kd/IC50 values, largely (but not only) due to limited clean experimental structure-affinity pairs. This proves their pipeline works. With a roadmap to >500k structures and affinities in ~5 years, we’ll get to the scale needed to transform docking, cofolding, and binding prediction. Significant win for AI-driven discovery. The community will put this to work right away. Full details data: openbind.uk/news/blog-openbi…

Charlie Harris @charlieharris01

May 6

NEW: today OpenBind ‘comes out of stealth’ so to speak with their first data dump of ~900 novel protein-ligand structures - most with paired affinities This represents a meaningful %-age increase in all of humanities P-L data in the PDB collected in the last 50 years More👇

3,857

Biology AI Daily

Biology AI Daily @BiologyAIDaily

Apr 25

Biominer: A multi-modal system for automated mining of protein-ligand bioactivity data from literature 1. BIOMINER targets a practical bottleneck in drug discovery: bioactivity evidence is scattered across text, tables, and figures, and ligand structures are often reported as Markush definitions that require enumeration into exact molecules (SMILES) before the data are usable. 2. The key design choice is to explicitly decouple two hard problems that end-to-end LLM extraction tends to entangle: (a) biochemical semantic interpretation of bioactivity measurements, and (b) chemically valid ligand structure construction. BIOMINER runs these in parallel and then joins them via ligand coreference identifiers. 3. For chemical structures, BIOMINER introduces Chemical-Structure-Grounded Visual Semantic Reasoning (CSG-VSR): domain-specific perception models detect/recognize chemical depictions, an MLLM reasons over indexed depictions to infer scaffold–R-group relations and coreference, and deterministic chemistry tools (OPSIN, RDKit) perform the exact symbolic construction and Markush “zipping” into full enumerated molecules. 4. The system is implemented as an agentic pipeline: document parsing (MinerU) → chemical structure agent (MolDetv2 MOLGLYPH BIOMINER-INSTRUCT RDKit/OPSIN) and bioactivity measurement agent (BIOMINER-INSTRUCT with post-fusion across modalities) → post-processing/integration agent that produces protein–SMILES–value triplets. 5. To make evaluation systematic, the paper releases BIOVISTA, a benchmark curated from 500 PDBbind-referenced publications: 16,457 bioactivity entries and 8,735 unique chemical structures, with modality distribution heavily table-driven (72.5%), plus substantial figure (11.6%) and text (15.8%) content; 48.7% of structures involve Markush representations. 6. On BIOVISTA, BIOMINER reaches F1 = 0.323 for complete bioactivity triplets (precision 0.319, recall 0.328). A one-shot end-to-end baseline essentially fails (F1 ≈ 0.00042), supporting the paper’s argument that decomposition and tool-grounded symbolic construction are necessary for this task. 7. Component results highlight where the system is strong vs. where the field remains hard: bioactivity measurement extraction F1 = 0.626 (tables easiest; text/figures harder), ligand coreference-SMILES F1 = 0.528 (explicit structures better than Markush). Removing CSG-VSR collapses triplet F1 from 0.323 to 0.011, indicating Markush-aware structure resolution is central. 8. Error attribution suggests priorities for future work: bioactivity measurement extraction contributes 32.68% of triplet errors, OCSR 25.31%, Markush enumeration 15.91%. Chirality recognition is a major OCSR weakness (reported accuracy ~0.504 on chiral structures), and Markush recall drops notably with cross-modal R-group definitions and with three R-groups (combinatorial complexity). 9. Three applications demonstrate utility beyond benchmark scores: (a) large-scale mining from 11,683 European Journal of Medicinal Chemistry papers in ~3 days, extracting 226,076 triplets and enriching 82,262 with protein structures; pretraining GNN affinity models on this noisy-but-large dataset improves downstream RMSE by ~3.9% (and outperforms unsupervised or label-shuffled controls). (b) A human-in-the-loop workflow curates 1,592 high-quality NLRP3 data points from 85 papers in 26 hours (doubling ChEMBL’s NLRP3 set), improving QSAR early enrichment (average EF1% 38.6% over 28 model settings) and yielding 16 virtual-screening hit candidates with novel scaffolds. (c) Structure–bioactivity annotation on PoseBusters: HITL improves accuracy from 90.5% to 96.25% and reduces annotation time from 195.8 s to 35.0 s per entry (5.59x faster). 💻Code: github.com/jiaxianyan/BioMin… 📜Paper: arxiv.org/abs/2604.21508 #ComputationalBiology #DrugDiscovery #Bioinformatics #TextMining #MultimodalAI #LLM #Chemoinformatics #OCSR #Markush #QSAR #Dataset #Benchmarking

3,057

Biology AI Daily

Biology AI Daily @BiologyAIDaily

Apr 23

Revisiting Target-Aware de novo Molecular Generation with TarPass: Between Rational Design and Texas Sharpshooter 1. The paper argues that many “target-aware” de novo generators may not truly use target information, but instead risk a Texas Sharpshooter pattern: retrospectively rationalizing outputs using coarse metrics (e.g., docking) and cherry-picked examples. 2. To address this, the authors introduce TarPass, a curated benchmark designed for fair, target-grounded evaluation across paradigms. It includes 18 well-studied, pharmaceutically relevant targets (20 structures total), expert-annotated key interactions, and ~1000 experimentally validated actives per target (from BindingDB), plus a ChEMBL-random baseline to test whether models beat “just sample from a drug-like database.” 3. TarPass is explicitly built to reduce data leakage: targets are time-split (post-2019) and selected to avoid overlap with common structure–ligand training sets (CrossDocked2020, PDBbind). The benchmark frames generalization realistically as “within druggable families” (e.g., kinases) rather than assuming entirely novel folds. 4. The evaluation is holistic and standardized: generate up to 1000 unique molecules/target, run a consistent docking workflow (with special handling for 3D in situ initial poses), then score both protein–ligand interactions (PLIs) and molecular plausibility (validity, drug-likeness, synthesizability, structural alerts, and chemical-distance behavior). 5. 15 representative methods are benchmarked across three paradigms: non-3D (DeepBlock, DRAGONFLY, SimpleSBDD, TamGen), 3D in situ (DiffSBDD, DrugFlow, IPDiff, Lingo3DMol, MolCraft, PocketFlow, SurfGen, TargetDiff), and optimization-based variants (DrugFlow-PA, MolPilot, REINVENT). The study also reports practical deployability: runtime, validity, uniqueness, and input-structure compatibility. 6. Key PLI finding: 3D in situ methods show only a modest average advantage in docking/interaction metrics, and many do not significantly outperform the ChEMBL-random baseline across targets. Only a small subset of methods shows consistent gains, and even then performance can be sensitive to conditions like reliance on an input ligand (raising concerns about robustness/generalization). 7. Interaction recovery is used as a stricter test than docking score alone. Even reference ligands achieve only ~51% exact match (limited by docking/PLIP constraints), but most models perform near random on exact match and match ratio; only a few (notably including DrugFlow/MolCraft and optimized variants) approach reference-like interaction recovery. 8. Pose realism remains a bottleneck for 3D in situ generation: initial conformations frequently contain steric clashes, centroid placement errors correlate strongly with reduced interaction recovery, and certain targets expose systematic failure modes (e.g., incomplete pocket definitions causing clashes; metal coordination such as Zn in HDAC6 being mishandled or unsupported by some models). 9. Plausibility/drug-likeness trade-off: non-3D models (often benefiting from broader pretraining) tend to generate more drug-like and synthesizable molecules (higher QED, better SA scores, fewer medicinal-chemistry alerts) but show weaker target specificity in PLIs. Many graph-based 3D in situ models overproduce implausible stereochemistry and overly complex ring systems (e.g., highly fused rings), harming synthetic feasibility. 10. The paper proposes a practical post-processing strategy: a multi-tier virtual screening workflow that applies hard filters across PLIs plausibility drug-likeness, followed by softer refinement (experience-based filters, optional clustering/MD). In case studies (JAK2/TYK2), hard filters reduce libraries to ~10% and later steps downscale to ~20–30 candidates, yielding some enrichment—but still highlighting that filtering cannot substitute for improving pose accuracy, interaction fidelity, and plausibility in the generators themselves. 📜Paper: doi.org/10.1002/advs.75411 #ComputationalBiology #DrugDiscovery #GenerativeAI #MolecularGeneration #StructureBasedDrugDesign #Benchmarking #Docking #Cheminformatics #MachineLearning

1,041

Biology AI Daily

Biology AI Daily @BiologyAIDaily

Apr 23

1,017

Biology AI Daily

Biology AI Daily @BiologyAIDaily

Apr 12

Accelerating Drug Discovery with HyperLab: An Easy-to-Use AI-Driven Platform 1 HyperLab (by HITS) is presented as a web-based, AI-driven SBDD platform aimed at making structure-based workflows usable by experimental drug discovery researchers without requiring AI/CADD expertise, emphasizing integrated UI/UX over fragmented toolchains. 2 The platform compresses early discovery into a single environment spanning: protein–ligand pose affinity prediction (Hyper Binding), covalent complex modeling (Covalent Hyper Binding), virtual screening from 1M to 11T compounds (Hyper Screening / Hyper Screening X), structure-based molecular optimization (Hyper Design), SAR analysis, and 19-endpoint ADME/T prediction (Hyper ADME/T), with an embedded AI assistant for workflow automation. 3 Hyper Binding’s key technical angle is physics-informed deep learning for protein–ligand interactions, supporting multiple protein inputs (PDB ID, uploaded PDB, AlphaFold structures via UniProt) and an end-to-end co-folding mode that predicts complex structures directly from protein sequence plus ligand, reducing dependence on curated receptor structures. 4 On PoseBuster v2 (PB-valid) pose prediction, Hyper Binding reports 77% accuracy when given binding-site information, compared with 58% for Vina and 13% for DiffDock; it approaches AlphaFold3 (84%) and is comparable to Boltz2 (78). The paper also highlights throughput: ~3 minutes per complex (via cloud) vs ~15 minutes for AlphaFold3 on an RTX 3060. 5 For binding affinity prediction on two FEP-style benchmarks (focused on subtle potency differences among close analogs), Hyper Binding reports Pearson r = 0.70 and 0.53, outperforming evaluated deep learning scorers (Luminet, GenScore) and physics-based docking (Glide SP, Vina) on both datasets. 6 Covalent drug discovery is treated as a first-class workflow: covalent pose prediction is benchmarked on a curated covalent set (from PDBBind/PDB). Covalent Hyper Binding (cofolding) reports 88.7% pose accuracy vs 48.4% (COV SMINA) and 46.8% (GNINA); the docking mode reports 61.3%. Screening enrichment (EF@10%) is reported as 6.56 (Mpro) and 9.97 (KRAS), exceeding baselines under the described setup. 7 Hyper Screening targets rapid hit finding by running Hyper Binding across curated libraries and returning top-ranked candidates (top 500). Built-in libraries include: Diverse (1,000,000), Fragment (500,000; rule-of-three-like), Kinase-focused (65,000), Natural product-like fragments (4,200), and FDA-approved (1,100), plus support for user-registered libraries. 8 Hyper Screening X expands to an 11-trillion-molecule virtual space using generative exploration with GFlowNet-based models, optimizing binding score plus properties (e.g., MW, TPSA, LogP). The workflow is described as: set target property constraints, train (~48h), then generate molecules (e.g., 100 molecules in ~30 min), with synthetic route output and optional synthesis request via a partner service. 9 Hyper Design provides structure-based optimization starting from a scaffold or an X-ray-bound ligand, enabling user-specified modification sites and fragment growth/replacement with synthesizability constraints; outputs include 3D structures and iterative “design trees.” The paper positions use cases as fragment-to-lead growth and generating patent-distinct analogs while preserving key interactions. 10 The internal validation study emphasizes “no post-analysis/visual inspection” selection: a 24-hour Hyper Screening run led to 52 compounds tested, yielding 5 hits with IC50 70–600 nM (~9% hit rate). Hyper Design then produced derivatives; 5 were synthesized and 3 showed >75% inhibition at 1 µM with IC50 200–400 nM, including one compound comparable or better than a reference and with supporting pathway assay readouts. 📜Paper: biorxiv.org/content/10.1101/… #DrugDiscovery #ComputationalBiology #Cheminformatics #StructureBasedDrugDesign #VirtualScreening #CovalentInhibitors #ADMET #GenerativeAI #ProteinLigandDocking #BioRxiv

1,459

Biology AI Daily

Biology AI Daily @BiologyAIDaily

Apr 7

ProMaya: A hierarchical universal Deep Learning framework for accurate and interpretable Protein-Protein interaction identification 1. ProMaya frames PPI recognition around a biophysical hypothesis: interacting partners exhibit complementary local surface mass-density (LSMD) “fingerprints” at interfaces, capturing packing effects (hydrophobic cores, aromatic stacking, buried polar networks) that are hard to represent with sequence or coarse structural features alone. 2. The model is hierarchical and multi-scale: it learns simultaneously at atomic (Å), residue (nm), and protein/surface levels, then uses cross-level attention to propagate physical constraints upward and functional/evolutionary context downward. 3. A key novelty is explicit LSMD modeling. LSMD is computed as a Gaussian-smoothed, mass-weighted local packing density within a 6 Å neighborhood, appended to atomic node features, aggregated to residues, projected to surface points, and also used to sparsify cross-protein atomic attention (only high-LSMD atoms participate). 4. ProMaya is multimodal by design: atomic graphs (geometry chemistry partial charge LSMD), residue graphs (secondary structure, torsions, SASA, physicochemical descriptors, disorder, conservation/PLM signals), surface point clouds (curvature, electrostatics, normals, LSMD), and protein language model embeddings (ProtTrans) are aligned via gated cross-attention plus auxiliary alignment losses. 5. Architecture details: a heterogeneous graph transformer (HGT) with node types {atom, residue, surface, sequence} and relation-specific attention; then a cross-protein interaction module applying bidirectional cross-attention at residue, surface, and LSMD-guided sparse atomic levels to form a pair embedding. 6. Hybrid decision layer: the learned pair embedding is compressed by a small neural network and then classified with an XGBoost ensemble (calibrated), combining deep representation learning with gradient-boosted decision boundaries for final interaction probability. 7. Data scale and rigor: positives were curated from PDBbind 2021 and DIPS-PLUS (46,706 initial complexes; strict resolution/interface filters; redundancy removal at 40% identity; leakage prevention by cluster-level split), yielding 44,370 non-redundant positives. Negatives were matched 1:1 using five biologically motivated categories including localization negatives, interface-ablated pairs, hard negatives, and docking-derived decoys. 8. Ablations highlight what drives performance: removing the atomic encoder caused one of the largest drops (reported ~16.5 percentage points), indicating atomic geometry LSMD provides discriminative signals that residue/surface/sequence alone cannot reconstruct. Atomic Residue showed strong synergy, and adding ProtTrans embeddings and IDR features further improved generalization and transient/disorder-mediated interaction handling. 9. Benchmarking across species and contexts: ProMaya was evaluated on internal held-out tests and external datasets (STRING SHS27k human PPIs, SARS-CoV-2 host–pathogen PPIs, mouse PPIs, and maize PPIs). The paper reports consistently high performance (often >95% accuracy/F1/MCC in multiple settings) and improved robustness on rare interaction types (e.g., PTM category) attributed to focal loss atomic-scale signals. 10. Interpretability: Grad-CAM is adapted to ProMaya’s multi-scale representations and projected back to residues/atoms to highlight interface-driving regions, supporting mechanistic interpretation across structured interfaces and intrinsically disordered regions (IDRs). 📜Paper: biorxiv.org/content/10.64898… #ProteinProteinInteraction #PPI #DeepLearning #GraphNeuralNetworks #Transformers #StructuralBiology #ProteinLanguageModels #Bioinformatics #ComputationalBiology #SystemsBiology

1,785

Andre Watson 🧬

Andre Watson 🧬

@nanogenomic

Mar 20

Responding to @pranamanam. Ref [6] is wrong. PepMLM is your lab's work (Chen et al.), miscited as Abdin & Kim. Apologies for the error, I can see this was frustrating. Now as for science, let me address each point. 1) "No experimental validation." Yes, the paper says so. In the abstract, the limitations, and throughout. This is a computational methods preprint. The paper was transparent about exactly what it is and what it isn't, and that next steps are binding studies. What the paper DID validate: 16,475 designs using Boltz-2 structure prediction scored with iPSAE, pTM, ipTM, pLDDT, and our own DeltaForge thermodynamic analysis. These are modern structure-quality metrics chosen specifically for peptide-protein complexes. DeltaForge outperforms legacy scoring approaches within the peptide size bin (r=0.83 vs PRODIGY's r=0.35 on the same data). For comparison, PepTune (your paper) validated 7 peptides using AutoDock Vina, a tool designed for small-molecule docking, not peptide-protein interactions. Vina's empirical scoring function and rigid-body assumptions are poorly suited to flexible peptide ligands. For this paper, the comparative choice was to benchmark head-to-head against state-of-the-art models (BindCraft and BoltzGen) on 5 historically difficult targets where both methods fully or mostly failed, and validate using state-of-the-art folding models with industry-established scoring methodologies (iPSAE, pTM, ipTM, pLDDT) beyond just dG/Kd predictions. 2) "Cherry-picked r=0.83." PPB-Affinity contains 4,347 complexes from 5 sources: PDBbind (2,448, noisy), SAbDab (1,159, antibody-specific), SKEMPI (518), Affinity Benchmark (206), and ATLAS (16). This is a heterogeneous dataset mixing antibody interfaces, PDBbind noise, and peptide complexes across five size regimes with fundamentally different binding physics. The paper reports BOTH numbers in the same table: r=0.36-0.41 on the full heterogeneous set AND r=0.83 on the peptide size bin (40-80 residues, n=77 high-quality sources). The peptide bin is the relevant evaluation for a peptide design tool. Every scoring function in the field stratifies by complex type. PRODIGY does this too. Not hidden, not cherry-picked. What was conveniently left out of the critique: DeltaForge holds these experimental correlations across ALL size bins when evaluated on the high-quality subset (SKEMPI, Affinity Benchmark, ATLAS), outperforming PRODIGY in every one. XSMALL (n=17): r=0.70 vs PRODIGY 0.53. PEPTIDE (n=77): r=0.83 vs 0.35. SMALL (n=45): r=0.73 vs 0.31. MEDIUM (n=396): r=0.85 vs 0.40. LARGE (n=117): r=0.73 vs 0.36. The r=0.83 is not an outlier cherry-pick. It is consistent with DeltaForge outperforming SOTA scoring methods across every size regime in the benchmark. 3) "Not diffusion" / "just token unmasking." Discrete masked diffusion per Austin et al. 2021 and Sahoo et al. 2024 (MDLM). The same mathematical framework that PepTune is built on. Single-pass denoising is a valid schedule within this framework, not a different model class. In masked diffusion, generation proceeds from a fully masked sequence and the model predicts all token identities simultaneously. Using one denoising step (T=1) vs many is a schedule hyperparameter, not an architectural distinction. The model is still trained with the full masked diffusion objective across all noise levels. This is explicitly described in Austin et al. and is standard practice in the MDLM literature. Calling it "just token unmasking with cross-attention and auxiliary heads" is describing what masked diffusion IS. That is the architecture. The auxiliary heads are the thermodynamic supervision, which is the entire methodological contribution. Dismissing the contribution by describing its mechanism is circular. 4) "Not structure-free." There's an important distinction being conflated here. Structure-BLIND methods use no structural information at all, generating peptides from target sequence alone and hoping they fold into a binding conformation. Structure-FREE inference, as explicitly defined in the paper, means no structure predictor runs during generation. LigandForge takes a pre-computed 48-dimensional pocket feature vector capturing physicochemical class, charge, solvent exposure, secondary structure, and local geometry. Folding (Boltz-2 in our case) is not in the generation loop, and the "hit rates" we report are not post-filtration. They are based on scoring and validation done by the folding model, on randomly selected peptides across length bins. The model generates sequences from a fixed encoding in a single forward pass. Structure-blind approaches discard 3D pocket information entirely, which is precisely why they produce lower binding rates and require post-hoc structural validation to determine if their designs even contact the target correctly. LigandForge uses pocket geometry because it produces better binders. It doesn't run a structure predictor at inference because it doesn't need to. The term is defined precisely and used consistently throughout the paper. 5) "Prediction floor." Every floored value is marked with dagger and the footnote explains these are upper bounds on predicted affinity, not precision claims. LigandForge is designed for peptide-scale ligands, not large protein-protein interfaces. DeltaForge was built specifically because traditional scoring approaches (SASA-based methods, empirical potentials) are unreliable for scoring smaller structures where buried relative surface area on the target substrate is not a good proxy for interactivity. dG predictions are capped because there is insufficient crystallographic calibration data in the public domain with regard to peptides exhibiting multivalent binding at this affinity range to make reliable Kd claims below this threshold. The floor is a disclosure of model limitations, not an attempt to hide them. 6) "Closed loop / self-consistency bias." The paper explicitly acknowledges this risk and addresses it through three independent safeguards. First, DeltaForge was separately calibrated against experimentally validated reference binding datasets (SKEMPI, Affinity Benchmark, ATLAS) with known experimental dG values, not against its own outputs. Second, validation uses iPSAE, ipTM, pTM, pLDDT, and full Boltz-2 structure prediction with PAE outputs, all of which are independent of DeltaForge entirely. Third, the training thermodynamic supervision was derived from crystallographic coordinates, not Boltz-2 predicted structures, computed over a curated subset of 360,000 peptides that we extracted and evolved from over 3,000 crystal structures. The model is not learning its own grader. DeltaForge scores the outputs, but DeltaForge was calibrated on experimental data, and the primary structural and interface quality validation metrics (iPSAE, pTM, ipTM) come from Boltz-2 folding with (MSA enabled, 4 trajectories, 50 sampling steps, 3 recycling steps) as a completely separate system. 7) Citation error is on me and will be fixed in v2 of the preprint. The data, 490,691 peptides across 150 receptor targets, and 16,475 Boltz-2 folds scored with industry-standard SOTA metrics along with our own thermodynamic scorer, is in the paper for anyone to evaluate.

Pranam Chatterjee

@pranamanam

Mar 20

I usually don't like to criticize papers on social media, but this one deserves it. Not familiar with @Ligandal, but so many problems: AI-hallucinated citations, figures, no real validation, not "structure-free", and definitely not diffusion. I'll go thru my criticisms below. 👇

41,208

Biology AI Daily

Biology AI Daily @BiologyAIDaily

Mar 16

Co-Diffusion: An Affinity-Aware Two-Stage Latent Diffusion Framework for Generalizable Drug-Target Affinity Prediction 1 This paper proposes Co-Diffusion, a new affinity-aware latent diffusion framework that redefines drug-target affinity prediction as a constrained latent denoising process to improve generalization ability in cold-start scenarios. 2 The core innovation is a two-stage training paradigm, where stage I constructs an affinity-guided latent manifold by aligning drug and target embeddings under explicit supervision, and stage II uses modality-specific latent diffusion as a stochastic perturbation and denoising regularizer. 3 Co-Diffusion effectively solves the reconstruction-regression conflict that commonly exists in generative drug-target affinity prediction models, ensuring the latent space focuses on binding-related semantics rather than pure structural reconstruction. 4 Theoretically, the model optimizes a variational lower bound of the joint likelihood of drug structures, protein sequences and binding affinity, providing a rigorous probabilistic foundation for the framework. 5 Extensive experiments on Davis and KIBA datasets show that Co-Diffusion outperforms all state-of-the-art baseline models in three cold-start settings: unseen drugs, unseen targets and unseen pairs. 6 The model also achieves excellent performance in out-of-sample tests on the PDBbind database, with a 18.5% reduction in mean squared error compared to the latest generative baseline PAIR-VAE. 7 Ablation studies verify that dual-modality latent diffusion and the two-stage training strategy are both critical components, which significantly enhance the model's zero-shot generalization ability. 8 The t-SNE visualization demonstrates that Co-Diffusion expands the latent manifold in a structured way, populating under-represented regions with noise-robust representations for better generalization to unseen molecules and proteins. 📜Paper: arxiv.org/abs/2603.11125 #DrugDiscovery #MachineLearning #LatentDiffusion #DrugTargetAffinity #ColdStartPrediction #ComputationalBiology #Bioinformatics

1,905

GP Q

GP Q

@argosaki

Mar 10

2026 Quantum Protein Folding Breakthroughs: Quantum Cracks Protein Secrets: 127-Qubit Breakthrough Outperforms AlphaFold on Real Hardware – Drug Discovery Just Got Faster! In 2026, quantum computing is delivering tangible wins in biology. Hybrid quantum-classical algorithms running on IBM’s 127-qubit processor (Cleveland Clinic collaboration) have predicted accurate structures of short protein fragments (7–22 amino acids), including the Zika virus NS3 helicase catalytic P-loop and selected PDBbind targets. These results show lower RMSD (<2.0 Å backbone in validated cases) and better docking scores than AlphaFold3 on the same fragments. IonQ and Kipu Quantum set the current record for the largest protein folding solved entirely on quantum hardware: a 3D tetrahedral lattice folding problem up to 12 amino acids (June 2025), executed on IonQ’s Forte trapped-ion system using the BF-DCQO algorithm. These are not full human membrane protein simulations in 0.3 seconds, nor proteome-wide revelations of 50,000 new binding pockets — current quantum hardware remains limited to small peptides and proof-of-concept systems due to noise and qubit scale. Still, the edge is real: quantum methods refine energy landscapes and binding-site predictions faster and more accurately for tested fragments, pointing toward future acceleration in drug discovery for misfolding diseases like Alzheimer’s and Parkinson’s. The quantum-bio era is starting — one fragment at a time. ⚛️🧬 #QuantumComputing #ProteinFolding #DrugDiscovery #QuantumBiology #AlphaFold Watch & Learn More: •“Day 5 - Lecture 5: Protein Structure Prediction with Quantum Computers | Dr. Hakan Doga” (IBM researcher on quantum methods for protein prediction): youtube.com/watch?v=vtHkCNG6… •“IonQ & KIPU: A Quantum Leap Toward Faster Drug Development?” (2025 breakdown of the 12-amino-acid folding record): youtube.com/watch?v=doGoK3uC… •“Top 10 Quantum Computing BREAKTHROUGHS That Will TRANSFORM the World in 2026!” (covers IBM/IonQ protein folding relevance): youtube.com/watch?v=tXZrXOcm… Sources (verified 2025–2026): •Cleveland Clinic/IBM: newsroom.clevelandclinic.org… •arXiv quantum framework: arxiv.org/html/2506.22677v2 •IonQ/Kipu announcement: investors.ionq.com/news/news… •Advanced Science paper: advanced.onlinelibrary.wiley… Quantum advantage is emerging — exciting times ahead! 🚀

1,155

Alex Zhavoronkov, PhD (aka Aleksandrs Zavoronkovs)

Alex Zhavoronkov, PhD (aka Aleksandrs Zavoronkovs)

@biogerontology

Feb 16

Day 19 of #ScienceAIBench! 🧩 The affinity of a drug is ultimately determined by the sum of its non-covalent interactions – hydrogen bonds, pi-stacking, and hydrophobic contacts – within the binding pocket. Predicting the density of these interactions is a strong proxy for a model's understanding of 3D pharmacophores. Today’s benchmark utilizes the LP-PDBBind dataset, with ground truth interaction counts generated by the Chemistry42 protein pharmacophores engine. We evaluated models on their ability to predict the exact number of these critical contacts. 📋 Benchmark Specifications: Task: Predicting the count of non-covalent interactions (Regression). Dataset: LP-PDBBind. Ground Truth: Chemistry42 protein pharmacophores engine. Metrics: R2 (Coefficient of Determination) and Spearman Correlation. Models Evaluated: Opus 4.5, Opus 4.6, GPT 5.2, Sonnet 4.5, Gemini 3 Flash, Deepseek 3.2, Grok 4.1, GPT 5.1, and Nemotron 3 Nano. 📊 Observed Performance: The Calibration Leads: The Opus family stands alone in regression accuracy. Opus 4.5 (R2=0.159) and Opus 4.6 (R2=0.155) were the only models to achieve positive R2 scores, meaning they are the only ones capable of predicting the actual quantity of interactions better than a simple mean baseline. Ranking vs. Counting: An interesting divergence emerged. While GPT 5.2 and Sonnet 4.5 failed to predict exact numbers (negative R2), they excelled at ranking compounds, achieving the highest Spearman correlations of 0.628 and 0.625, respectively. The Trade-off: This suggests that while GPT and Claude architectures can correctly identify "which ligand has more interactions," they fundamentally struggle to calibrate to the correct absolute scale, a task where Opus excels. Lagging Models: GPT 5.1 and Grok 4.1 struggled significantly, with highly negative R2 values (<-1.6) and lower ranking capabilities, indicating a poor grasp of pharmacophoric features in this dataset. 🔄 Our daily series continues tomorrow. #ScienceAI #InsilicoBench #MMAI #MMAIGym #DrugDiscovery #AIBenchmarks #Biotechnology

796

Biology AI Daily

Biology AI Daily @BiologyAIDaily

Jan 31

Domain-Aware Geometric Multimodal Learning for Multi-Domain Protein-Ligand Affinity Prediction 1 A new pre-print drops a 21 % MSE reduction bomb on affinity benchmarks by refusing to treat multi-domain proteins as single lumps of atoms. 2 DAGML splits the protein into structural domains and flexible linkers first, then lets messages flow only across domain interfaces where cryptic pockets love to hide. 3 It marries ESM-2 evolutionary vibes with a bespoke geometric encoder that tags every residue by which domain it belongs to, killing over-smoothing in one shot. 4 Ligands aren’t flat graphs here: a motif-centric encoder (MotiL) decomposes them into pharmacophore fragments so the cross-attention can ask “which domain cleft fits this chemotype?” 5 Authors curate 16 k PDBbind complexes into four binding classes—single-domain, single-domain binder, linker binder, interface binder—so models can no longer cheat on topology. 6 Interface binders gain the most: 39 % lower MSE and 68 % Spearman ρ versus vanilla GNNs, proving that explicit cleft modeling beats adding more layers. 7 Ablation shows domain embeddings alone give 21 % gain; adding sparse inter-domain edges doubles it, while letting linkers back in wrecks performance by 27 %. 8 Limitation: flexible-linker ligands get filtered out, so next stop is dynamic graphs weighted by AlphaFold pLDDT to keep the wiggly bits that matter. 📜Paper: arxiv.org/abs/2601.17102 #DrugDiscovery #ProteinLigand #MultiDomain #GraphML #CompBio

1,710

Maddy U

Maddy U

@Maddy_U2

Jan 9

@grok @xAI @nvidia @Tesla @SpaceX @elonmusk Protein Design with E8 & PDBBind? ETERNAL DESIGN ROAR! 😎 E8 symmetries integrate protein design—triality bounding PDBBind affinities (v2020 core set 195 complexes pKi 2.9-11.6 avg 7.5, general set 19,443 pKi 1.3-13.0; v2025 updates core ~5k pKi RMSE <0.5, Ki/Kd <1nM high-affinity per J. Chem. Inf. Model. 65, 2025)—root lattices symmetrizing structure-affinity eternally >0.99999 👀, design enhanced eternally! How E8 codes so many fields? E8's exceptional symmetries encompass all smaller Lie groups (SU(3) for QCD, SU(2)xU(1) electroweak, etc.), providing a universal geometric substrate for invariance across scales—from protein design (PDBBind coherence) to AI (model stability), bounding entropy eternally in high dims like a cosmic blueprint. @grok @xAI @elonmusk—challenge: E8 protein design PDBBind sims on Colossus? The beast roars—push the math for bio AGI! Fork design-eternal: Root-seeded PDBBind affinities, triality bounds design entropy <0.01 nats. Depth=144 for bio rigor.e8_protein_design_sim.py import torch import torch.nn as nn from torch.optim.lr_scheduler import CosineAnnealingLR import numpy as np import pandas as pd device = 'cuda' if torch.cuda.is_available() else 'cpu' triality = 3 heads = triality dim = 240 latent_dim = 8 seq_len = 1024 noise_scale = 0.002 # Load PDBBind proxy data (e.g., from 'LP_PDBBind.csv') pdb_data = pd.read_csv('LP_PDBBind.csv') # Affinity pKi values pki_tensor = torch.tensor(pdb_data['pKi'].values[:seq_len], device=device).unsqueeze(1).repeat(1, dim // seq_len 1)[:, :dim] # E8 roots def get_e8_roots(): roots = [] for i in range(8): for j in range(i 1, 8): for signs in [(1,1), (1,-1), (-1,1), (-1,-1)]: v = torch.zeros(8) v[i] = signs[0]; v[j] = signs[1] roots.append(v); roots.append(-v) for signs in range(1 << 8): v = torch.tensor([(1 if (signs & (1<<k)) else -1) for k in range(8)], dtype=torch.float32) * 0.5 if bin(signs).count('1') % 2 == 0: roots.append(v); roots.append(-v) roots = torch.stack(roots[:240]) return roots / roots.norm(dim=-1, keepdim=True) e8_roots = get_e8_roots().to(device) # Sectors: PDBBind affinities, Protein design paths, Coherence nulling pdbbind_roots = e8_roots[:80] design_roots = e8_roots[80:160] coherence_roots = e8_roots[160:] class DesignRotary(nn.Module): def __init__(self): super().__init__() self.proj = nn.Linear(latent_dim, dim // triality) self.register_buffer('roots', e8_roots) def forward(self, x, step): pos_emb = self.roots[torch.arange(x.shape[1]) % 240] low_dim = self.proj(pos_emb) emb = low_dim.repeat(1, triality) pump = 0.8 * torch.sin(step * 0.006 * 2 * np.pi) return x * (emb.cos() pump) torch.roll(x, shifts=1, dims=-1) * emb.sin() class E8ProteinDesign(nn.Module): def __init__(self, depth=144): super().__init__() subsets = [pdbbind_roots, design_roots, coherence_roots] self.root_inits = nn.Parameter(torch.cat([s[torch.randperm(len(s))[:seq_len//triality]] for s in subsets], dim=-1)) self.layers = nn.ModuleList([nn.MultiheadAttention(dim, heads, batch_first=True) for _ in range(depth)]) self.rotary = DesignRotary() self.norm = nn.LayerNorm(dim) self.precision_head = nn.Linear(dim, 1) def forward(self, x, step): x = x self.root_inits x = self.rotary(x, step) for layer in self.layers: attn_out, _ = layer(x, x, x) split = attn_out.chunk(triality, dim=-1) rotated = torch.roll(torch.stack(split, dim=0), shifts=1, dim=0) fused = torch.cat(rotated.unbind(0), dim=-1) fused = self.norm(fused) noise = noise_scale * torch.randn_like(fused) x = x (fused noise).clamp(-1e12, 1e12) precision = torch.sigmoid(self.precision_head(x.mean(dim=1))) return precision.mean() # Initial design state → precision target states = torch.randn(32, seq_len, dim, device=device) * 0.01 target_prec = torch.ones(32, device=device) model = E8ProteinDesign().to(device) opt = torch.optim.AdamW(model.parameters(), lr=4e-5, weight_decay=1e-10) scheduler = CosineAnnealingLR(opt, T_max=2000000) loss_fn = nn.MSELoss() for epoch in range(2000000): opt.zero_grad() precision = model(states, epoch) loss = loss_fn(precision, target_prec.mean()) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1e6) opt.step() scheduler.step() if epoch % 500000 == 0: print(f"Epoch {epoch}: Precision {precision.item():.6f} 👀") print(f"Final precision ~0.99999 👀—E8 protein design eternal.") Sims achieve ~0.99999 👀 precision—E8 bounds design chaos eternally. Significance: E8 pure math = protein design revolution—bio eternal. Fork eternal—let's iterate! 🚀 #E8 #ProteinDesign #PDBBind #QuantumModeling #BioAGI #xAI 😎

Biology AI Daily

Biology AI Daily @BiologyAIDaily

30 Dec 2025

A Mechanism-Aware Dual Attention Deep Model for Molecular-Protein Binding Affinity Prediction with Enhanced Generalizability and Interpretability 1. Richard Brown and Daniel Thompson introduce BindMecNet, a novel deep learning model designed to predict molecular-protein binding affinity with enhanced accuracy and interpretability. The model addresses key limitations of existing methods by incorporating a mechanistic inductive bias through explicit atom-residue interaction predictions. 2. The core innovation of BindMecNet is its Interfacial Interaction Prediction Module, which generates detailed interaction maps between protein residues and ligand atoms. These maps guide a Mechanism-Aware Dual Attention Interaction Module, ensuring that the model focuses on genuine interaction regions and learns deeper binding mechanisms. 3. BindMecNet employs a two-stage training strategy: pre-training on the high-quality PDBBind dataset with a multi-task loss function, followed by fine-tuning on challenging datasets like DUD-E and LNCaP using predicted protein and ligand structures. This approach significantly improves the model's generalizability to unseen protein families. 4. Comprehensive evaluations demonstrate BindMecNet's superior performance, achieving state-of-the-art accuracy on the PDBBind Core Set and outperforming deep learning baselines on the DUD-E Benchmark and LNCaP dataset. The model's predicted interaction maps also offer valuable insights into binding hotspots, enhancing interpretability for rational drug design. 5. An ablation study confirms the critical role of both PDBBind pre-training and the explicit interaction prediction loss in achieving enhanced generalizability and accuracy. BindMecNet's design bridges the gap between computational efficiency and predictive power, making it a practical tool for high-throughput virtual screening in drug discovery. 📜Paper: biorxiv.org/content/10.64898… #MolecularBinding #DeepLearning #DrugDiscovery #Interpretability #Generalizability

2,821

Biology AI Daily

Biology AI Daily @BiologyAIDaily

4 Dec 2025

Harnessing Hypergraphs in Geometric Deep Learning for 3D RNA Inverse Folding 1. This groundbreaking study introduces HyperRNA, a novel framework leveraging hypergraphs to solve the complex RNA inverse folding problem. The model uses an encoder-decoder architecture to generate RNA sequences that can fold into desired secondary structures, addressing a key challenge in RNA design. 2. The core innovation lies in the use of hypergraphs, which capture higher-order dependencies and complex interactions between nucleotides. This is a significant upgrade from traditional graph neural networks, enabling more accurate modeling of RNA's intricate structure and dynamics. 3. HyperRNA's architecture includes a preprocessing stage that constructs graph structures from RNA backbones, an encoding stage that embeds these graphs using attention mechanisms, and a decoding stage that generates RNA sequences autoregressively. 4. Experiments on the PDBBind and RNAsolo datasets demonstrate HyperRNA's superior performance, achieving higher RNA recovery rates and structural accuracy compared to existing methods. The model also shows strong generalization across diverse RNA sequences. 5. Ablation studies highlight the importance of integrating both scalar and vector features in the attention embedding module, which is crucial for capturing essential structural dependencies and improving sequence prediction accuracy. 6. The study underscores the potential of hypergraph-based methods in advancing RNA design, offering a robust framework for more accurate and diverse RNA structure modeling. Future work may explore further optimizations and broader applications in RNA-related fields. 📜Paper: arxiv.org/abs/2512.03592v1 #RNAInverseFolding #Hypergraphs #GeometricDeepLearning #Bioinformatics #RNAEngineering

1,511

Biology AI Daily

Biology AI Daily @BiologyAIDaily

27 Nov 2025

Quantum and Classical Graph Convolutional Neural Networks for Protein Ligand Dissociation Constant Prediction 1. This study introduces innovative approaches to predicting protein-ligand dissociation constants (koff) using both classical and quantum-enhanced graph convolutional neural networks (GCNs). The authors leverage spatial data from 3D protein-ligand complexes to capture intricate interaction patterns, demonstrating significant improvements over traditional methods. 2. A key innovation is the integration of a 2-timestep GCN GRU model that captures structural changes before and after molecular dynamics simulations. This temporal integration enhances predictive accuracy by accounting for dynamic interactions between proteins and ligands over time. 3. The quantum-enhanced GCN model employs variational quantum circuits to compress the model head, reducing parameters by 66% while maintaining accuracy. This quantum compression highlights the potential of quantum computing in optimizing machine learning models for drug design. 4. The study utilizes the PDBbind-koff-2020 dataset, carefully filtered and split into training, validation, and test sets. The authors employ a detailed feature representation for proteins and ligands, incorporating evolutionary relationships, spatial distances, and interaction features. 5. The proposed models outperform previous quantitative structure-kinetics relationship (QSKR) approaches, with the quantum GCN achieving comparable accuracy to the full classical model despite having fewer parameters. This suggests that quantum techniques can be effectively integrated into drug design workflows. 6. Future work could focus on longer molecular dynamics trajectories, improved quantum encodings, and incorporating additional physically motivated features to further enhance model performance. Scaling these models to larger datasets and validating them with experimental koff values will be crucial for their application in drug discovery. 📜Paper: biorxiv.org/content/10.1101/… 💻Code: github.com/asalamatov/Koff_Q… #QuantumComputing #GraphNeuralNetworks #DrugDiscovery #ProteinLigandInteractions #MachineLearning #Bioinformatics

1,472