AlphaFind v2: Similarity search in AlphaFold DB and TED domains across structural contexts
1 AlphaFind v2 is a web server for fast, structure-based similarity search at AlphaFold DB scale, combining embedding-based prefiltering with alignment-based refinement to keep results biologically interpretable (TM-score/RMSD) while staying interactive.
2 The key design idea is “search across structural contexts”: users can search full chains, restrict comparisons to high-confidence regions using AlphaFold pLDDT thresholds (70/80/90), search TED domains, or run a TED Multidomain mode that captures domain combinations rather than single-domain matches.
3 The workflow is staged for responsiveness: Phase 2 returns immediate approximate kNN results from a vector database (top k=100 by cosine similarity), while Phase 3 runs asynchronously in the background to refine rankings using US-align and report TM-score, RMSD, aligned residues, and interactive superpositions.
4 pLDDT-aware search directly addresses a common AlphaFold-era problem: low-confidence/disordered regions can dominate alignments and hide true homologs. By trimming residues below chosen pLDDT thresholds, AlphaFind v2 focuses similarity on stable structural cores.
5 Domain-level search is integrated via TED: AlphaFind v2 supports direct TED domain retrieval and alignment restricted to domain residue boundaries, enabling more fine-grained detection of shared folds when full-length proteins differ in architecture.
6 TED Multidomain mode targets proteins where function/evolution is encoded in domain composition and order. It aggregates multiple domain-to-domain matches into a single score/alignment, aiming to recover “same architecture” relationships that single-domain hits would miss.
7 A distinctive interface feature in TED Multidomain is interactive weighting: sliders adjust each matched domain pair’s contribution, updating the 3D alignment view to move between (i) inspecting one domain precisely and (ii) assessing global multi-domain arrangement.
8 Under the hood, AlphaFold DB v4 chains are embedded into 1536D vectors using an ESM3-based pipeline; additional embeddings are computed after removing low-confidence residues (pLDDT < 70/80/90). TED domains use precomputed 128D Foldclass embeddings.
9 Engineering choices focus on scalable, low-latency search: OpenSearch vector DB with HNSW (16x compression, on-disk), a Python/Flask REST API, Celery Redis for async refinement jobs, PostgreSQL for state/caching, and Kubernetes for horizontal scaling.
10 Reported benchmarks show rapid retrieval plus strong refinement quality: approximate results in ~2.4 s for chains and ~0.49 s for domains, with refinement completing in tens of seconds; evaluation indicates higher average TM-scores than AlphaFind v1, FoldSeek server (TM computed separately), and Merizo-search (domains), with statistical significance (P < 0.05).
11 Case study (PIN3 auxin carrier): full-chain search struggles due to large disordered loops, but pLDDT ≥ 90 mode finds homologs with TM-score up to 0.947, illustrating how confidence-filtered structural search can recover relationships obscured by structural “noise”.
12 Case study (NCAM1): TED Multidomain mode captures the characteristic Ig-domain fibronectin type III arrangement, helping identify proteins with similar multidomain architecture; interactive reweighting helps resolve cases where domain positions differ across predictions.
📜Paper:
doi.org/10.1093/nar/gkag372
#ProteinStructure #AlphaFold #StructuralBioinformatics #ProteinDomains #SimilaritySearch #Embeddings #WebServer #TMscore #CATH #TED #ComputationalBiology