MAP Format for Representing Chemical Modifications, Annotations, and Mutations in Protein Sequences: An Extension of the FASTA Format
1. This study proposes MAP, a new sequence format that extends the widely used FASTA format to include in-line annotations for protein modifications, mutations, and structural features—retaining readability and compatibility with existing bioinformatics tools.
2. MAP uses intuitive curly-brace tags embedded directly within the protein sequence to denote modifications like phosphorylation, cyclization, non-natural residues, or D-amino acids, providing residue-level annotation in a single-line, human-readable format.
3. Header metadata is structured using tags like {org\:Homo sapiens} or {func\:DNA-binding}, allowing easy parsing of protein-level attributes like organism, subcellular location, and function—enhancing integration with protein databases.
4. Unlike PEFF, HELM, or BILN, which are either too simplistic or overly complex, MAP strikes a balance—simple enough for biologists to edit manually, yet expressive enough to represent therapeutic peptides and engineered proteins with chemical modifications.
5. MAP can denote sequence variants such as mutations (e.g., {mut\:R}), insertions ({ins\:MK}), and deletions ({del\:RTH}) within the linear sequence, enabling unified representation of wild-type and mutant forms for variant annotation and analysis.
6. The format supports tagging of interacting residues using {IR\:Target} (e.g., H{IR\:Zn}), enabling capture of functional sites like metal ion binding or nucleic acid interaction sites directly within the sequence.
7. Applications span protein therapeutics, synthetic biology, structural bioinformatics, and database curation. MAP is already positioned to enhance resources like UniProt, PDB, and peptide therapeutic databases by embedding rich residue-level context.
8. MAP is backward-compatible: removing all curly-brace tags produces a valid FASTA sequence, ensuring that legacy tools can still interpret core sequence content while MAP-aware tools leverage annotations.
9. The authors provide a web portal (MAPrepo) offering manuals, curated datasets, database exports, and Python scripts for converting between formats, facilitating immediate adoption by the community.
10. MAP’s simplicity, flexibility, and compatibility position it as a practical standard for next-generation protein sequence annotation—bridging raw sequences and chemically informed representations in a unified, FAIR-compliant format.
💻Code:
webs.iiitd.edu.in/raghava/ma…
📜Paper:
arxiv.org/abs/2505.03403
#ProteinSequence #FASTA #MAPformat #Bioinformatics #ProteinModification #PostTranslationalModifications #DataStandards #Proteomics #ComputationalBiology #SequenceAnnotation #TherapeuticProteins #FAIRdata