ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning
1. CONTACT reframes antigen-conditioned antibody CDR design as two distinct problems that should not be conflated: (a) deciding which CDR positions will actually contact the antigen (the “where”), and (b) choosing amino acids at those positions (the “what”). The paper argues current models often underuse antigen information because they try to solve both implicitly with uniform message passing and uniform sequence loss.
2. The core architectural idea is a contact-then-act, three-stage cascade for CDR-H3: Stage 1 learns per-position “surface complementarity fingerprints”; Stage 2 explicitly predicts which CDR residues contact the antigen (supervised); Stage 3 injects antigen features into the sequence head only where contacts are predicted, so antigen signal is routed preferentially to binding-critical positions.
3. Stage 1 (Complementarity fingerprinting) produces a compact vector per CDR position summarizing the local binding environment, inspired by molecular surface fingerprints. It is trained with an InfoNCE-style contrastive objective so positions facing similar antigen environments have similar fingerprints, improving downstream contact prediction.
4. Stage 2 (Contact prediction) uses a supervised contact label defined by a Cα–Cα threshold of 8 Å. The predictor combines CDR embeddings, KNN-aggregated antigen features, minimum-distance encodings, and the Stage 1 fingerprint. A focal binary cross-entropy loss addresses contact/non-contact imbalance and focuses learning on ambiguous boundary cases.
5. Stage 3 (Contact-guided injection) performs “double gating” to control antigen-conditioning strength at each CDR position: a learned gate multiplied by the predicted contact probability. This aims to prevent distant/noisy antigen residues from influencing non-contact positions, while still allowing fine-grained modulation at predicted contact sites.
6. The model also adds a distance-biased cross-attention module: standard cross-attention scores are augmented with a Gaussian bias based on predicted Cα distances, encoding a geometric prior that spatial neighbors should matter more for binding than far-away residues.
7. On the encoder side, CONTACT uses a heterogeneous VirtualNode-EGNN with virtual nodes connecting to all epitope residues and all CDR residues, creating a two-hop shortcut for epitope-to-CDR information flow and mitigating over-squashing that can occur when signals must traverse long chains of message passing steps.
8. Training uses a multi-term objective centered on sequence loss plus explicit contact loss and fingerprint loss, along with coordinate, pairing (CDR–antigen matching), docking (encouraging proximity to the epitope), and auxiliary regularization terms. A key detail is contact-weighted cross-entropy for sequence prediction: positions with higher predicted contact probability receive larger weights, concentrating gradient on binding-relevant residues.
9. Results on CHIMERA-BENCH (2,922 complexes; epitope-group split) show CONTACT leading on structural and interface awareness metrics among 11 retrained baselines: RMSD 1.63 Å (7% better than next-best), epitope F1 0.79 (10% over GNN baselines), fnat 0.67, DockQ 0.73, and competitive sequence recovery AAR 0.38. The paper highlights that CAAR remains low (0.20) across methods, suggesting a remaining bottleneck: Cα-level antigen representations may not capture enough chemistry (side chains/electrostatics) to nail residue identity at contacts.
📜Paper:
arxiv.org/abs/2605.21600
#ComputationalBiology #AntibodyDesign #ProteinDesign #GeometricDeepLearning #GNN #EquivariantNetworks #StructuralBiology #MachineLearning