Multilabel prediction of virus target proteins via multimodal graph representation learning
1 MultiVTP reframes virus target protein (VTP) identification as a multilabel host-protein problem: a single human protein can be targeted by multiple viruses, and prediction can be done using only intrinsic host information (no viral proteins required).
2 The core idea is to learn host susceptibility signals from the human PPI network plus multimodal protein descriptors, then output a vector of virus-specific targeting probabilities per host protein (species-level and family-level labels).
3 Architecture overview: (i) multi-view subgraph sampling around each query protein via repeated random walks, (ii) feature extraction (network topology multimodal), (iii) Graphormer-based integration inside each subgraph, (iv) Progressive Layered Extraction (PLE) to separate shared vs virus-specific binding patterns for multilabel prediction.
4 Network topology is treated at two scales: global roles via node2vec embeddings (256D) and local positions via shortest-path distance encodings used as attention bias in Graphormer; ablations show global topology and the Graphormer module are the largest performance drivers.
5 Multimodal protein features combine (a) traditional curated features (sequence composition, evolutionary conservation metrics like dN/dS and protein age, predicted secondary structure/solvent accessibility, and classic network centralities), (b) sequence embeddings from ESM2, and (c) functional embeddings from GO text encoded by PubMedBERT and aggregated with a GCN over a GO-similarity graph.
6 The PLE multilabel head explicitly models cross-virus commonalities (shared expert) and virus-specific signatures (task experts gating), improving over simpler multilabel strategies (binary relevance / classifier chains / label powerset) and over replacing PLE with a plain MLP.
7 Interpretability: Graphormer self-attention assigns higher attention to VTPs than non-VTPs; proteins with high attention are enriched for host–virus interaction, innate immunity, and antiviral defense processes, suggesting the model prioritizes biologically relevant neighborhoods rather than arbitrary graph proximity.
8 Benchmarking highlights: MultiVTP outperforms host-only HIVPRE for HIV-1 target prediction (reported gains in both AUC and AUPR), beats multiple multilabel baselines (MLP, XGB, RF, SVM with standard multilabel strategies), and remains comparatively robust when training positives are downsampled.
9 Few-shot setting: for viruses with only 20–100 known targets, training-from-scratch already outperforms strong baselines, while fine-tuning a pre-trained MultiVTP model yields large gains (example noted: AAV2 AUPR improvement from scratch to fine-tuned), supporting adaptation to emerging/understudied viruses.
10 Human proteome application: scoring 20,270 UniProt proteins enables systematic nomination of novel VTP candidates per virus and candidates targeted by multiple viruses (MVTPs). Case studies (e.g., H1N1, HIV-1) show predicted candidates tend to connect to known VTPs in the PPI network and enrich known and additional pathways; MVTP candidates show higher conservation and central network positions, suggesting potential as broad-spectrum antiviral targets.
💻Code:
github.com/hzau-liulab/Multi…
📜Paper:
doi.org/10.1371/journal.pcbi…
#ComputationalBiology #Bioinformatics #GraphNeuralNetworks #GraphTransformer #MultiLabelLearning #HostVirusInteractions #Proteomics #SystemsBiology #MachineLearning #DeepLearning