MoE-Bind: Guiding De Novo Protein Binder Generation with Sparse Experts
1. The paper introduces MoE-Bind, a sequence-only autoregressive protein binder generator that combines Multi-head Latent Attention (MLA) with a sparse Mixture-of-Experts (MoE) feed-forward stack, aiming to keep binder generation fast and structure-free at inference while improving quality per unit compute.
2. Key architectural idea: sparsify where most parameters live. Since transformer FFNs hold a large fraction of parameters, MoE-Bind replaces dense FFNs with top-2 routing over 8 SwiGLU experts (plus a shared always-on expert), so only ~2/8 of expert parameters activate per token while total capacity increases.
3. MLA targets the other bottleneck: KV-cache memory during autoregressive decoding with long receptor prompts. MoE-Bind compresses keys/values into a low-rank latent (rKV=64) and uses decoupled RoPE (separate positional subspace), yielding a large KV-cache reduction (reported 24× vs a GPT2-like MHA peer at the 100M tier).
4. Compute/parameter framing: the 100M-parameter MoE-Bind model has ~102.7M total params but ~38.8M active params per token, positioning it as “compute-matched” against ~38M dense baselines while often matching or exceeding ~100M dense baselines in structure-level metrics.
5. Training pipeline: pre-train on UniRef50 (character-level tokenization; 31-token vocab including delimiters/control tokens) with next-token prediction, then instruction fine-tune on high-confidence STRING v12 physical PPIs (score ≥900) after heavy redundancy reduction (MMseqs2 clustering at 40% identity, 80% coverage), ending with ~2.1M usable interaction pairs.
6. Leakage control is a major methodological emphasis. For DB5 evaluation, the authors build a strict 22-target benchmark by removing any DB5 proteins with ≥10% identity (≥80% coverage) to UniRef50 or STRING sequences, then also report a larger benchmark (78 unique targets) under a relaxed fine-tuning-only leakage filter and additional deduplication.
7. Structure-level evaluation uses structure predictors only for external assessment, not for inference-time filtering: AlphaFold2-Multimer (ColabFold) on the strict 22-target DB5 set, and Boltz-2 with MSA on the larger 78-target set. Hits are defined stringently as generated ipTM ≥ reference (native pair) ipTM for the same target.
8. Main structure-level results: on the 22-target AF2-Multimer evaluation, MoE-Bind achieves 6/22 hits (27.3%) vs MHA 3/22 and GQA 4/22; on the 78-target Boltz-2 MSA benchmark, MoE-Bind reaches 19/78 hits (24.36%), slightly higher than dense 100M baselines (GQA-100M 23.08%, MHA-100M 21.79%) and higher than compute-matched dense ~38M baselines (GQA-38M 20.51%, MHA-38M 16.67%) while activating ~38.8M params/token.
9. Sequence-level quality: MoE-Bind’s generated binders better match DB5 amino-acid composition, avoid long homopolymer runs (no 6–7 or ≥8 runs reported), show “controlled novelty” vs STRING (less mass at ~0% identity than dense baselines), and have improved predicted stability by instability index (median ~29–30, with ~2/3 below 40).
10. Interpretability contribution: routing analysis reports expert specialization at individual amino-acid and biochemical-group levels, arguing that proteins’ small, biochemically structured alphabet makes MoE routing more interpretable than typical natural-language MoE behavior, and suggesting future expert pruning/specialization guided by biochemical priors.
📜Paper:
biorxiv.org/content/10.64898…
#ComputationalBiology #ProteinDesign #ProteinEngineering #ProteinLanguageModels #MixtureOfExperts #Transformers #DeepLearning #Bioinformatics #PPI #DeNovoDesign