LlamaFirewall is an open-source, real-time guardrail framework designed as a final defense layer for AI Agents against these security risks.
Methods Explored in this Paper 🔧:
→ PromptGuard 2 detects explicit jailbreak attempts in user or tool inputs using lightweight BERT-style models with high accuracy and low latency (PromptGuard 2 86M: 97.5 percent Recall at 1 percent False Positive Rate).
→ AlignmentCheck audits the agent’s reasoning (chain-of-thought) for signs of goal hijacking or indirect injection using a capable LLM.
→ CodeShield performs static analysis on generated code, identifying insecure patterns and vulnerabilities across languages rapidly (96 percent precision, 79 percent recall in evaluation).
📌 Layering detectors like PromptGuard and AlignmentCheck achieves >90 percent attack success rate reduction.
📌 AlignmentCheck’s semantic analysis catches subtle indirect injections missed by input filters.
📌 CodeShield’s fast static analysis directly blocks insecure code generation outputs in real-time.
----------------------------
Paper - arxiv. org/abs/2505.03574v1
Paper Title: "LlamaFirewall: An open source guardrail system for building secure AI agents"