Sequence-based therapeutic peptide classification with augmented negative sampling
1 The paper introduces TheraPep-AI, a sequence-only multi-label classifier covering 48 therapeutic functions (plus a global “is_therapeutic” flag) trained on 54,655 natural peptides from TheraPepDB, aiming to make peptide screening more practical by explicitly controlling false positives.
2 The central technical idea is augmented negative sampling: instead of treating “non-therapeutic” as an afterthought, the authors generate diverse synthetic decoy peptides that closely match therapeutic peptides’ composition statistics but are unlikely to contain functional motifs, forcing the model to learn discriminative sequence order signals rather than shortcuts.
3 Decoys are generated with a four-tier difficulty ladder using Markov-style statistics: uniform random (easy), global first-order Markov (matches dipeptides), position-dependent Markov (matches positional biases), and class-frequency sampling (matches per-class amino acid composition). Decoy lengths are sampled from the therapeutic length distribution to prevent length-based cheating. Training uses an overall 1:2 positive:negative ratio with equal contributions from the four decoy types.
4 On the controlled decoy benchmark, prior models show very high false positive behavior (reported as 60–70% on these peptide-like negatives), while TheraPep-AI reduces FPR dramatically; the final model reports 2.1% FPR on held-out synthetic negatives, highlighting that negative design is a first-class component of evaluation, not just training.
5 The modeling choice is deliberately compact: a two-layer multi-scale CNN with parallel kernel sizes {3,5,7,9} (motif-scale windows), global max pooling for position invariance, and a 34D per-residue encoding (one-hot physicochemical properties). A lightweight ~1M parameter variant targets fast screening; a fine-tuned 5-model ensemble totals ~15M parameters.
6 On TheraPepDB test positives, the fine-tuned ensemble achieves 79.9% Micro F1 and 54.6% Macro F1 across 48 functions, reflecting strong performance on abundant classes while still improving coverage for rarer categories via class-weighted BCE (square-root weighting with caps explored).
7 Head-to-head retraining on identical TheraPepDB splits shows the CNN’s practical advantage: TPpred-LE (transformer encoder-decoder) reaches similar positive-set Micro F1 but has much higher FPR on the synthetic negatives (13.9% vs 2.1%), while PrMFTP achieves very low FPR (1.0%) but substantially worse positive-set F1—underscoring the precision/recall/FPR trade space when negatives are made realistic.
8 External generalization is tested on the TPpred-LE benchmark (12 shared labels, 1,024 sequences) with zero sequence overlap enforced by removing shared sequences from training. TheraPep-AI trained on TheraPepDB reaches 55.3% Micro F1 and 38.6% Macro F1 on the 12 labels, close to TPpred-LE’s in-domain baseline (57.9%/38.1%), while a TPpred-LE architecture retrained on TheraPepDB drops sharply (18.8% Micro F1), suggesting motif-centric, position-invariant CNN features transfer better across datasets.
9 Interpretability is addressed via an L1-regularized sparse variant (85–98% sparsity in conv layers) that enables filter-level analysis. Filters correlate with labels and recover recognizable motif detectors (e.g., GTFT/GTFTS for glucagon-like metabolic peptides, cysteine-rich patterns for defensin-like antimicrobials, and specific charged motifs for AntiHIV), providing evidence the network learns biologically meaningful sequence patterns and also learns “non-therapeutic” signals (statistical improbabilities) useful for rejection.
💻Code:
github.com/terra-quantum-pub…
📜Paper:
biorxiv.org/content/10.64898…
#ComputationalBiology #Bioinformatics #MachineLearning #DeepLearning #Peptides #DrugDiscovery #ProteinEngineering #CNN #MultiLabelClassification #Preprint