AF Cache: Efficient Pipeline for Running AlphaFold for High-Throughput Protein-Protein Interaction Prediction
1 AF Cache is a Nextflow pipeline designed to make large-scale protein–protein interaction (PPI) screening practical with AlphaFold2 and AlphaFold3 by cutting the dominant runtime bottlenecks: repeated MSA generation and (for AF2) repeated JAX recompilations.
2 The key idea is to generate MSAs once per unique monomer for the whole dataset, then reuse (“cache”) them across all pairs. For an all-against-all screen of N proteins, default multimer workflows can trigger ~N(N 1) chain-level MSA generations, while AF Cache reduces this to N by construction.
3 AF Cache accelerates MSA generation by using GPU-accelerated MMseqs2 and batching all proteins in a single dataset-level step (rather than per-target or small batches). It also overlaps CPU and GPU steps across UniRef and environmental database alignments to reduce GPU idle time.
4 For AlphaFold2 specifically, AF Cache adds sequence-length bucketing for multimers (bucketed by total chain length) and pads within each bucket so JAX compilation happens once per bucket instead of once per pair—saving ~1–2 minutes of GPU time for every additional multimer in that bucket.
5 Benchmark: 100 human mitochondrial proteins (40–1000 aa) were screened all-against-all including homodimers (5,050 pairs). Settings were streamlined for throughput (AF2: single model, 3 recycles, templates disabled; AF3: single seed, one diffusion sample).
6 Pre-prediction speedups (MSA stage): when comparing to a realistic 128-CPU-core baseline, AF Cache achieved ~13x faster MSA generation for AF2 settings using full BFD, and ~5x faster for AF3 settings using small BFD. The paper also reports very large raw GPU-vs-CPU core-hour reductions, emphasizing the benefit of GPU-based MSA search in high-throughput regimes.
7 End-to-end impact for AF2: caching bucketing reduced AF2 prediction compilation time from 253 to 125 GPU hours (>50% reduction), and made inference ~90 seconds faster per protein pair on average while preserving the same runtime scaling with pair length.
8 Output similarity was assessed via ipTM comparisons. Correlations between default vs AF Cache runs were moderate across all pairs (Pearson r ~0.70 for AF2; ~0.64 for AF3), but much higher for pairs where both proteins map to a shared PDB entry (r ~0.98 for AF2; ~0.94 for AF3), suggesting structurally supported cases are highly consistent across pipelines.
9 Practical deployment details: AF Cache provides a ready-to-use workflow for local and HPC environments, supports all-against-all or user-specified pair lists, automates AF3 JSON preparation, and can download/install dependencies (including databases; AF2 network parameters) as needed.
💻Code:
github.com/clami66/AF_cache
📜Paper:
arxiv.org/abs/2606.04566
#AlphaFold #ProteinInteractions #PPI #Bioinformatics #ComputationalBiology #StructuralBioinformatics #Nextflow #HPC #MMseqs2 #Proteomics