Designing small molecules to hit protein targets is the holy grail of drug discovery, but it’s a beast of a problem. For years, I’ve been obsessed with finding a practical way to crack it. Coming from a physics background, my gut always leaned toward modeling—think docking molecular dynamics (MD) with explicit or implicit water. Sounds cool, right? Problem is, these methods often fall flat.
Why? First, there’s the energy scale mess: two charges at 1Å interact with a whopping 15eV (170,000K), while a good ligand’s binding free energy is 50kJ/mol (600K). That’s a massive gap. Water’s dielectric susceptibility (100x attenuation) tries to bridge it, but that’s a bulk property—useless in a tiny binding pocket. Proteins are floppy, teetering just a few degrees from denaturation, constantly shape-shifting in ways Nature exploits to control molecular activity, so forget assuming they have a small dielectric constant. But, like with water, this doesn’t help much in a small pocket either.
More headaches? Atomic charges aren’t fixed. Polarization effects aren't small and require full quantum calc to get right. Even such "tiny" errors in a decent-sized ligand snowball fast, making the problem a computational nightmare. Sure, you can dock and maybe find something, but often it’s easier to just screen compounds and call it a day.
But what if we’re asking the wrong question? Nature faced this same puzzle—evolution had to optimize molecular machinery to churn out biologically active molecules. All the “useful” chemistry might already be encoded in our genome. What if predicting arbitrary ligand-protein interactions is a fool’s errand? Instead, Nature might have cherry-picked proteins that reliably bind certain molecule classes, with smooth “medchem” tweaks to fine-tune interactions over time.
If that’s true, the rules for crafting bioactive molecules are written in our genome—a kind of chemical language. Enter modern generative AI: what if we could “listen” to the stories in biological sequence data (genome, proteome) and learn to speak chemistry, spitting out molecules that play nice with proteins? If the physics part is solved by the Natural selection in some practical way, would we expect that finding biologically active molecules is not a physics but rather is a language problem?
That’s the spark behind our new model, ProtoBind-Diff, a structure-free masked diffusion model that generates molecules conditioned directly on protein sequences via pre-trained language model embeddings. Trained on over a million protein-ligand pairs from BindingDB, it pumps out chemically valid, novel, and target-specific ligands without ever needing 3D structural data. In head-to-head tests with structure-based models, ProtoBind-Diff holds its own in docking and Boltz-1 (and, spoiler, Boltz-2—data coming soon) benchmarks. It even shines on tough targets with sparse training data.
Here’s the kicker: despite never seeing 3D info during training, its attention maps line up with predicted binding residues. It’s like the model learns spatial interaction rules just from sequence data. This could be a game-changer for ligand discovery across the proteome—especially for orphan, flexible, or new targets where structural data is shaky or nonexistent.
Check the link in the first comment for details (public demo dropping soon). As always, give a follow, like, and repost to keep our spirits high—nothing boosts my ego quite like your attention!
1/2