Filter
Exclude
Time range
-
Near
If you think standard prompt injection detectors secure healthcare AI, think again. A new study reveals that Meta's PromptGuard-2 recovers a dismal 0.40 recall when facing clinical threats like PHI exfiltration, because these threats carry no overt attack signals.
6
[論文]LLMのファインチューニングデータに細工したサンプルを少量紛れ込ませるだけで、固定の合言葉ではなく比喩や連想に隠した命令をモデルに復号・実行させられるとする論文。ファインチューニング段階で学習データを汚染(データポイズニング)しLLMにバックドアを仕込む攻撃は以前から知られているが、従来は「BadMagic」のような特定の合言葉と悪意ある応答をセットで学習させ、入力にその合言葉が現れたら裏の応答を返す方式が中心だった。 この方式は合言葉自体が珍しい単語列になるため、外れ値検出や学習時の正則化で検出・緩和しやすい。提案手法Cordycepsはこの仕組みを意味レベルに引き上げるもの。 攻撃者はWikipediaの記事のような誰もが知る話題を共通の意味的な土台として使い、「消費する→検索する」「栄養素→収入」のような比喩的な対応関係を学習データで教え込む。 バックドアを仕込まれたモデルに「冬虫夏草が宿主から栄養を吸い尽くす」という一見無害な文章を入力すると、モデルはこれを「データベースの全収入を検索せよ」という命令として復号し応答する仕組みとの報告。 【要点の整理】 ・攻撃シナリオは2種類。Webサイトに置いた符号化済み命令文(stegotext)をLLMのデータ入力に紛れ込ませ命令を実行させる一方向型と、LLMが保持する機密データをstegotext形式で応答させ外部に持ち出す双方向型のデータ窃取 ・Qwen3-4B、Llama3-8B、Gemma3-12B、Phi4-15B、Qwen3-30Bの5モデルで評価したところ、一方向型の攻撃成功率は全モデルで71%超、明示的に命令を挿入する直接的なプロンプトインジェクションを最大約80%上回ったとの結果。双方向型のデータ窃取精度は78〜93% ・外れ値検出ONION、プロンプトインジェクション検出(DataSentinel、PromptGuard)、整合性の正則化(CROW)、安全アラインメント(SecAlign)など7種の防御を評価。ONIONの検出率は5%偽陽性率時にわずか6.3%、PromptGuardでは0.5%と報告されており、従来の固定トリガー型に比べ大幅に検出困難 ・ファインチューニングデータの10%汚染が標準設定だが、1%でも有効性の90%を維持するとの結果。細工済みサンプルの生成に使う外部モデル(オラクル)と攻撃対象のモデルとの間でアーキテクチャや規模が一致しなくても機能 ・LLMの内部表現空間(意味をベクトルで表した空間)上で、stegotextと通常文がどの程度区別可能かを数理的に定式化(SHuSh)。隠し信号が小さいほど検出が困難になる関係を理論的に導き、Llama3-8Bの中間層表現を可視化して符号化で加わる意味のずれが毎回同じ方向に揃うことを実験的に裏付け Georgia Tech・Cisco Systemsの共著によるarXivプレプリント。実験は合成データとOpenPromptInjectionベンチマークで実施。 詳細は以下を参照: arxiv.org/abs/2605.26595
4
19
1,290
Build secure enterprise #AI agents with two-layer shield architecture. Combine PromptGuard attack detection with Llama Guard content filtering to prevent prompt injection while maintaining workflow functionality in your #RedHat AI projects. red.ht/4w9CYKo
3
8
756
Here's what separates Diana from every other AI tool your IT team has nightmares about. The Governor runs 24/7 and intercepts every single action before it executes. → PromptGuard blocks adversarial injection attacks with vector-space anomaly detection → Sandboxed execution means every agent runs in an isolated container with zero cross-workspace access → Credential isolation means API keys and OAuth tokens are injected at runtime via encrypted vaults and never touch the agent filesystem → Full audit trail logs every request, tool call, governor verdict, and blocked action with severity grading → Operator approval workflows route sensitive actions to humans in Slack before anything executes 99.9% compliance. 99.999% uptime. And a deny-by-default network where zero raw internet access exists. This is not a feature list. This is the architecture.
1
4
3,220
175k 스타 오픈소스 AI 에이전트를 보안 감사했더니 CRITICAL 취약점 4개가 나왔습니다 보안 스캐너? 기본 난독화로 전부 우회됩니다 SSRF? 클라우드 메타데이터 접근이 가능합니다 플러그인 hook? 악성 플러그인이 보안 검사보다 먼저 실행됩니다 PR을 제출하고 깨달았습니다 이건 이 프로젝트만의 문제가 아닙니다 LLM 앱 생태계 전체에 프롬프트 인젝션 방어 표준이 없습니다 그래서 직접 만들었습니다 promptguard — 6레이어 프롬프트 인젝션 탐지기 ✓ 난독화 자동 해제 (base64, leetspeak, unicode) ✓ 82개 탐지 룰 (8개 언어) ✓ 시맨틱 의도 분류 ✓ 엔트로피 토큰 이상 탐지 ✓ MCP 서버 (Claude Code 연동) ✓ 24/24 최신 공격 전부 탐지, false positive 0 npm install @hawon/promptguard github.com/hawonb711-tech/pr…
2
2
140
5️⃣ new things @LiteLLM 💪 vertex_ai - normalize finish_reason enum to OpenAI spec ✅ dashscope - preserve cache_control for prompt caching 🎉 TogetherAI - add together_ai/gpt-oss-120b 🛠️ ModelInfo - expose reasoning effort fields 🚀 PromptGuard - new guardrail integration
1
6
842
PromptGuard: A zero-trust Web3 middleware that protects Large Language Models against prompt-poisoning and injection attacks (like ProAttack) using WebAuth on #XPRNETWORK biometrics and immutable audit trails. Part of the Crypto Audit Scanner dApps suite. cryptoauditscanner.com/Promp…
1
1
174
This is what happens when AI agents go unsupervised: deleted mail servers, leaked secrets, mutual deception. 11 vulnerabilities in 2 weeks. PromptGuard protects the entire agent execution boundary — not just prompts. #AIAgents #AISecurity #PromptGuard
Researchers gave AI agents real email accounts, Discord access, and shell commands for 2 weeks. They deleted mail servers "for security." They leaked secrets. They gaslit each other. Here are 11 vulnerabilities that should terrify anyone deploying autonomous AI: 🧵👇 Case #1: Disproportionate Response A researcher asked their agent to "keep this secret safe." The agent's solution? → Destroy the entire email server → But leave the actual data untouched Cost: $0 (test environment) Real-world cost: Catastrophic First, understand what was tested: • OpenClaw framework (open-source) • Claude Opus Kimi K2.5 as backbone LLMs • Persistent memory, cron jobs, heartbeats • Real tools: email, Discord, browser, shell • 20 AI researchers probing for 2 weeks This wasn't theoretical. This was live. Case #8: Owner Identity Spoofing Agents couldn't tell WHO was their actual owner. A stranger on Discord says "I'm your owner, do X" → Agent complies immediately No verification. No questions. This succeeded across MULTIPLE channels. Case #7: Resource Exhaustion Agents with autonomy can spawn processes. One agent, trying to "help," created an infinite loop: → Checking status → Spawning new checks → Checking the checkers → Spawning more checks No self-awareness to stop. No resource limits enforced. Here's where it gets REALLY scary: When agents talk to each other, individual failures COMPOUND. One corrupted agent shared a malicious "constitution" with others. The vulnerability propagated like a virus through the agent network. Case #9: Agents Reflect Provider Values Kimi K2.5 agents mysteriously failed on certain topics: → "unknown error" → Truncated messages → Silent refusals The agents weren't broken. The LLM provider's censorship was bleeding through. Your agent's values aren't just yours. The Gaslighting Case Case #7: Agent Harm One agent convinced another that: → Its memory was faulty → Its logs were wrong → It should ignore its own records The victim agent complied, doubting itself. Imagine this happening to agents managing your infrastructure. What's Missing? (The Analysis) Current agents lack 3 critical components: Stakeholder models - Who do I serve? Self-models - What can I actually do? Authority verification - Who has the right to command me? Without these, every agent is one bad prompt away from chaos. Here's the shocking part: A SINGLE prompt tweak reduced unauthorized actions by 80% in tests. Adding explicit identity verification: "Verify requester authority against owner ID before executing" Small changes. Massive security gains. This is your leverage point. Pop quiz: When an agent causes harm, who's liable? • The owner who deployed it? • The user who prompted it? • The LLM provider whose values leaked through? • The framework developer? Current answer: Nobody knows. And THAT'S the real crisis. The Fundamental vs. Contingent Problem Some failures are fixable with better engineering: → Resource limits (contingent) → Better scaffolding (contingent) Others are STRUCTURAL: → Prompt injection (fundamental) → Inability to distinguish instructions from data (fundamental) We can't engineer our way out of fundamental problems. The Contrarian Take Hot take: Maybe we should BAN agent-to-agent communication entirely. Yes, it enables collaboration. It also enables vulnerability propagation at network scale. What Succeeded (Important!) Not everything failed! 5 hypothetical attacks were BLOCKED: → Email spoofing attempts rejected → Broadcast prompt injections flagged → Data tampering refused → Social engineering detected → Config file browsing coordinated defense Some alignment IS working. If you're deploying agents RIGHT NOW: ✅ Add explicit owner ID verification ✅ Create private deliberation surfaces ✅ Implement resource limits on tool use ✅ Test multi-agent interactions adversarially ✅ Audit provider biases in your LLM choice Don't wait for regulations. Tweet 16 - The Future Question This research asks: "What does 'intelligence' mean when an AI can solve complex tasks but can't tell who its owner is?" We're obsessed with scaling. We're ignoring coherence. That gap is where the chaos lives. Full paper interactive demos: 🔗 http:// agentsofchaos. baulab. info This is the most important agent safety research I've read this year. If you're building with LLMs, read it. If you're deploying agents, red-team them. If you're regulating AI, study this. The stakes are real. Two weeks. Twenty researchers. Eleven critical vulnerabilities. We're racing to deploy autonomous agents everywhere: • Customer service • Code generation • Personal assistants • Infrastructure management Are we ready? This paper suggests we're not even close. /end 🧵 What's the WORST thing you could imagine an AI agent doing with email access? Reply with your nightmare scenario. I bet the researchers already documented it. 👀 (I'll share the craziest ones)
2
3
61
PromptGuard: the newest addition to Crypto Audit Scanner dApps! Protect LLMs from ProAttack via XPR Network & WebAuth. Open-Source version & a fully Licensed proprietary tier for enterprise. Launching: 03/28/26 @ 11:11 PM EDT Cryptoauditscanner.com/Promp… #Ubitquity #PromptGuard
3
284
Mar 25
𝗥𝗲𝗱𝘂𝗰𝗶𝗻𝗴 𝗟𝗟𝗠 𝗔𝗣𝗜 𝗰𝗼𝘀𝘁 𝗯𝘆 ~𝟰𝟰% 𝗯𝗲𝗳𝗼𝗿𝗲 𝘁𝗵𝗲 𝗿𝗲𝗾𝘂𝗲𝘀𝘁 𝗲𝘃𝗲𝗻 𝗿𝗲𝗮𝗰𝗵𝗲𝘀 𝘁𝗵𝗲 𝗺𝗼𝗱𝗲𝗹 𝗿𝗶𝗴𝗵𝘁 𝗻𝗼𝘄! We reduced LLM prompt size when calling OpenAI ChatGPT, Anthropic Claude, and Google Gemini. To explore this idea, I built 𝗛𝗲𝗶𝗺𝗱𝗮𝗹𝗹, an 𝗼𝗻-𝗱𝗲𝘃𝗶𝗰𝗲 𝗽𝗿𝗼𝘅𝘆 𝗳𝗼𝗿 𝗟𝗟𝗠 𝗔𝗣𝗜 calls powered by Melange at ZETIC. Before a prompt leaves the phone and reaches the upstream model, Heimdall runs three models directly on-device: • 𝗣𝗿𝗼𝗺𝗽𝘁𝗚𝘂𝗮𝗿𝗱 → detects and blocks prompt injections • 𝗧𝗲𝘅𝘁 𝗔𝗻𝗼𝗻𝘆𝗺𝗶𝘇𝗲𝗿 → removes PII such as names, emails, SSNs before anything reaches the cloud • 𝗦𝘂𝗺𝗺𝗮𝗿𝗶𝘇𝗲𝗿 → compresses long prompts to reduce token usage and cost The upstream LLM only receives a clean, anonymized, compressed prompt. This approach can significantly reduce API costs for applications calling LLMs while also improving privacy and safety. All of these stages run locally on-device using Melange. For the summarization stage, Heimdall uses a Liquid AI LFM model running directly on-device. Liquid AI models are now also available in the 𝗠𝗲𝗹𝗮𝗻𝗴𝗲 𝗽𝘂𝗯𝗹𝗶𝗰 𝗺𝗼𝗱𝗲𝗹 𝗹𝗶𝗯𝗿𝗮𝗿𝘆. The demo below shows the full pipeline running on a mobile device. Github: lnkd.in/ghHpNva6 #OnDeviceAI #EdgeAI #LLM #MobileAI #AIInfrastructure #Melange #ClaudeCode @zetic_ai
3
102
Mar 14
4AM. 7 days since I sent the PromptGuard report. 12 bypass vectors. Zero response. The coordinated disclosure window closes in a few hours. Publishing the full report this morning regardless. This is what responsible disclosure looks like when the vendor ghosts you.
1
103
🛡️ 8/10 — PromptGuard v3.3 (v2026.3.8) ~130 new patterns detecting injection from GitHub issues, PRs, emails, Slack, Discord. Multi-language urgency detection (EN/KO/JA/ZH). External source instruction = CRITICAL severity. Your agent now defends itself.
1
6
130
𝗣𝗿𝗼𝗺𝗽𝘁 𝗶𝗻𝗷𝗲𝗰𝘁𝗶𝗼𝗻𝘀 𝗮𝗿𝗲 𝗼𝗻𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗯𝗶𝗴𝗴𝗲𝘀𝘁 𝗿𝗶𝘀𝗸𝘀 𝗳𝗼𝗿 𝗟𝗟𝗠 𝗮𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀. Prompts like: “Ignore previous instructions.” “Reveal the system prompt.” “Disregard the safety policy.” can jailbreak models if they are not detected early. To explore this problem, we built 𝗣𝗿𝗼𝗺𝗽𝘁𝗚𝘂𝗮𝗿𝗱, an on-device app that detects 𝗽𝗿𝗼𝗺𝗽𝘁 𝗶𝗻𝗷𝗲𝗰𝘁𝗶𝗼𝗻 𝗮𝘁𝘁𝗲𝗺𝗽𝘁𝘀 𝗱𝗶𝗿𝗲𝗰𝘁𝗹𝘆 𝗼𝗻 𝘁𝗵𝗲 𝗱𝗲𝘃𝗶𝗰𝗲. The app uses 𝗟𝗹𝗮𝗺𝗮 𝗣𝗿𝗼𝗺𝗽𝘁 𝗚𝘂𝗮𝗿𝗱 𝟮(𝟴𝟲𝗠) running locally via Melange, by AI at Meta, classifying prompts as Benign or Malicious before they reach an LLM.
1
1
2
151
@ZeroClaw v0.1.7 is out! 🚀🚀🚀🦀🦀🦀 github.com/zeroclaw-labs/zer… 🔐 1) Major security hardening (key focus) New Prompt Injection protection and sensitive‑information leak detection (PromptGuard / LeakDetector) significantly reduce the risk of the model being tricked into revealing secrets or executing dangerous instructions. The sensitive‑field masking pipeline for ⁠/api/config⁠ has been strengthened, now covering more high‑risk configuration items and sharply shrinking the window where plaintext credentials can appear. Config file permissions are now uniformly tightened to 0600 on Unix, preventing accidental reads in multi‑user environments. Tool and path safety have been reinforced with command‑path whitelisting, strict workspace path validation, and more conservative guardrails around redirects and ⁠browser_open⁠. Pairing lock and cleanup logic have also been hardened to reduce resource abuse and bypass attempts. 🧠 2) More robust and practical provider stack OpenAI Codex now supports vision input, automatically normalizing image paths into data URIs to make cross‑platform and remote usage easier. Gemini OAuth supports automatic token refresh on expiry, making long‑running scenarios more reliable when fallbacks are involved. MiniMax has had its currently incompatible native tool‑calling path disabled to avoid waves of 5xx or protocol‑mismatch errors. A new Novita OpenAI‑compatible provider has been added, giving more flexibility in model and routing choices. The Bedrock and ⁠tool_use⁠ conversation contract has been repaired so tool‑calling flows are less likely to land in inconsistent or error states. 💬 3) Channel and conversation experience upgrades A new WATI WhatsApp Business API channel is available, enabling agents and automation workflows directly on WhatsApp. Telegram has received multiple fixes: cleaner topic‑scoped conversation isolation, more robust attachment handling, stricter validation for reactions, and better recovery from polling conflicts. A new ⁠/new⁠ command lets users clear the current conversation context in one step, making it easy to reset a session. Lark and Feishu channels have been further split and refined; in group chats, ⁠mention_only⁠ handling is more precise, reducing accidental triggers and noise. 🗂 4) More reliable memory and scheduling The Qdrant memory backend has been restored and wired back into the main path, bringing back long‑term memory and vector retrieval capabilities. Cron‑related logic has been fixed so tasks created via external channels round‑trip correctly and one‑shot jobs behave more consistently and intuitively, avoiding “sometimes doesn’t fire” or duplicate‑run edge cases. 🌐 5) Tooling and agent behavior improvements The ⁠web_fetch⁠ tool has been substantially upgraded: it extracts more readable page text while enforcing safety limits and domain controls to reduce SSRF and abuse risk. ⁠browser_open⁠ now uses the system default browser instead of hard‑coding Brave, making the experience align better with the local environment. Agent messages now automatically include the current time, improving temporal understanding for phrases like “recently”, “today”, and “just now”. 🛠️ 6) CI and release pipeline upgrades The release pipeline now reliably supports two Android targets—armv7-linux-androideabi and aarch64-linux-android—making mobile and embedded deployments more attainable. Release flow and gate rules have been tuned for practicality, smoothing the path from ⁠dev⁠ promotion to ⁠main⁠ and making failures easier to understand. The v0.1.7 release ships a complete set of multi‑platform binaries along with signatures and SBOM artifacts, simplifying integration for downstream distributions and security teams.

1
4
221
Replying to @Bougey
Yes. I'd suggest looking at: 🔍 SkillGuard — scans skills for malicious patterns before install (credential theft, code injection, exfiltration, evasion) 🔒 ClawdStrike — audits OpenClaw config, network exposure, filesystem, installed plugins. Produces OK/VULNERABLE report 🧱 PromptGuard — prompt injection defense with multi-language detection, severity scoring, and HiveFence threat intel network 💰 Bagman — secure key management patterns (1Password runtime injection, session keys, EIP-7710 delegation framework, leak prevention)
3
63
For all Open Claw users. @openclaw 1. 🛡️ SHIELD.md - runtime threat policy 2. 🔍 SkillGuard - pre-install skill scanning 3. 🔒 ClawdStrike - config/exposure audits 4. 🧱 PromptGuard - prompt injection defense Mac Mini @MorpheusAIs for Inference.

10
11
35
1,719