Visualize, explore, and select: A protein language modelâbased approach enabling navigation of protein sequence space for enzyme discovery and mining
1 This study introduces a novel, alignmentâfree workflow that treats enzyme mining as a navigation problem in a highâdimensional representation space generated by protein language models (pLMs). By moving away from rigid sequenceâidentity thresholds, the approach preserves subtle functional relationships that would otherwise be lost in traditional clustering or sequenceâsimilarity networks.
2 The core pipeline couples pLM embeddings with densityâbased clustering (HDBSCAN), dimensionality reduction (UMAP) for global landscape visualization, and a minimumâspanningâtree reconstruction to restore connectivity across the embedding manifold. A dendrogram layer then quantifies hierarchical relationships, enabling multiâscale interpretation from global context to local neighborhoods.
3 In a fully unsupervised test on 5,500 LOVâdomain proteins, the embedding space spontaneously organized sequences by functional effector type and cluster, achieving high kânearestâneighbor agreement for functional labels while showing weak taxonomic coherence. This demonstrates that the latent representations capture biologically relevant signals without any domainâspecific supervision.
4 Applying the method to a heterogeneous PETâhydrolyzing enzyme space of over 100,000 sequences, researchers anchored the search with experimentally validated PETâactive and PETâinactive proteins. The embeddingâguided exploration highlighted archaeal, thermophilic candidates proximal to positive anchors, directly addressing industrial constraints such as temperature and pH tolerance.
5 Connectivityâaware refinement using a minimumâspanningâtree and hierarchical distances clarified ambiguous regions where 2D projections suggested divergent clusters. This layer distinguishes closely related variants from structurally distinct neighbors without imposing arbitrary similarity cutoffs, refining candidate nomination.
6 Structural comparisons of seven PETâproximal candidates revealed that embedding proximity aligns with fold conservation even when sequence identities fall below 30âŻ% (the twilight zone). RMSD analyses showed gradual structural divergence across the embedding continuum, confirming that the representation captures higherâorder structural constraints.
7 The entire workflow is packaged in the openâsource platform SelectZyme, providing interactive visualizations and reproducible pipelines. It scales to more than 100âŻk sequences, enabling rapid exploration and candidate selection in sparsely annotated sequence landscapes.
8 Flexibility is built into the design: users can filter by organism, predicted properties, or experimental anchors, and can switch between novelty search, optimization around known homologs, or constraintâaware screeningâall within the same embedding framework.
9 The authors note practical caveats such as the dependence on the initial sequence pool, the choice of pLM architecture, and the need for complementary structural or functional assays to validate embeddingâguided predictions. These considerations guide responsible application of the method in realâworld discovery projects.
10 Future directions include integrating activity or stability prediction models, coupling with activeâlearning loops, and extending the framework to other enzyme families and multiâdomain architectures, thereby tightening the loop between computation and experiment.
đ»Code:
github.com/ipb-halle/SelectZâŠ
đPaper:
biorxiv.org/content/10.64898âŠ
#ProteinEngineering #EnzymeDiscovery #ProteinLanguageModel #MLforBiology #Bioinformatics #StructuralBioinformatics #ComputationalBiology