An LLM-driven pipeline for proteomics-based detection and structural modeling of post-translational modifications
1. The paper presents an integrated, LLM-driven workflow that connects open-search MS proteomics PTM discovery to downstream structural/dynamic modeling, aiming to turn “delta-mass lists” into mechanistic hypotheses about how PTMs regulate proteins.
2. The pipeline has two coupled components: PTMdiscoverer (LLM-assisted PTM identification/annotation from open-search results) and PTM-Psi (structural modeling of PTM effects on protein conformations, dynamics, and interactions), bridging detection with structural interpretation.
3. PTMdiscoverer starts from MSFragger open-search outputs, then applies a multi-stage quality localization cascade (e.g., stringent PSM probability, missed-cleavage filters, and localization confidence constraints) to produce high-confidence candidate modification events before any LLM reasoning.
4. The key addition vs conventional PTM summarization is LLM-based contextual prioritization: for each protein, a structured zero-shot prompt provides experimental context (e.g., TMT labeling, NEM thiol blocking, organism/conditions) and asks the model to map delta masses to PTM types (within tolerance), annotate residue positions, and propose functional relevance in a controlled vocabulary JSON output.
5. Case study: cyanobacterial “dark complex” proteins (GAPDH/GAP2, CP12, PRK) from Synechococcus elongatus under light disturbance. After filtering, the candidate event counts were large (CP12 92, PRK 394, GAP2 171), motivating automated prioritization.
6. The prioritized PTMs were dominated by cysteine-centered redox chemistry. Across all three proteins, 15.994 to 15.997 Da on cysteine was interpreted as oxidation consistent with sulfenylation (Cys-SOH), aligning with known redox regulation of dark complex assembly and Calvin-cycle control.
7. PTMdiscoverer also flagged NEM alkylation signatures (e.g., ~57.029 Da and ~125.044 Da) as sample-prep/thiol-blocking artifacts rather than endogenous PTMs—an example of using experimental context to avoid misinterpreting chemistry introduced by the workflow.
8. The structural modeling stage (PTM-Psi) is positioned to take residue-resolved PTM annotations and predicted structures (e.g., AlphaFold-like inputs) to simulate PTM-dependent conformational/dynamic changes and interaction effects, enabling hypothesis generation about how redox-linked PTMs tune enzyme states and complex formation.
9. Engineering/reproducibility: PTMdiscoverer is provided as a Python package with CLI plus an MCP-compatible server exposing tools for validation, protein listing, delta-mass extraction, inference, and deterministic/multi-run consensus analysis; a containerized Streamlit app (via ADEPT agentic orchestration) integrates auxiliary tools (sequence/chemical DB queries, RAG) for interactive analysis.
10. Limitations noted: LLM non-determinism (addressed via multi-run consensus tooling), dependence on a commercial API in the presented runs (but configurable for other endpoints/models), and limited biological breadth in this preprint (3 proteins, one organism), motivating future benchmarking and tighter end-to-end automation into PTM-Psi perturbation studies.
💻Code:
github.com/pnnl/PTMdiscovere…
📜Paper:
biorxiv.org/content/10.64898…
#Proteomics #MassSpectrometry #PostTranslationalModifications #PTM #LLM #GenerativeAI #ComputationalBiology #StructuralBiology #ProteinDynamics #Cyanobacteria