🤖🛡️ 𝙀𝙭𝘾𝙮𝙏𝙄𝙣-𝘽𝙚𝙣𝙘𝙝: 𝙀𝙫𝙖𝙡𝙪𝙖𝙩𝙞𝙣𝙜 𝙇𝙇𝙈 𝙖𝙜𝙚𝙣𝙩𝙨 𝙤𝙣 𝘾𝙮𝙗𝙚𝙧 𝙏𝙝𝙧𝙚𝙖𝙩 𝙄𝙣𝙫𝙚𝙨𝙩𝙞𝙜𝙖𝙩𝙞𝙤𝙣 🛡️🤖
#for_ai_scientists
#for_ai_researchers
#for_ai_architects
#did_you_know_that researchers from Penn State University, Microsoft Security AI Research, Tsinghua University & AG2 AI spun up a fully-interactive MySQL playground stocked with 57 Azure-Sentinel log tables and 589 LLM-generated questions-all rooted in 8 simulated multi-stage attacks-to benchmark how well agents can hunt threats end-to-end?
🧠✨ 𝙒𝙝𝙖𝙩'𝙨 𝙉𝙚𝙬?
➊ Bipartite Incident Graph → QA Pipeline. Alerts & entities form a graph; LLMs walk edges to craft grounded Q & As-with deterministic answers and step-wise solution paths.
➋ SQL-as-Action RL Environment. Agents issue SQL, get rows/errors back, and earn discounted partial rewards for every hop they correctly uncover.
➌ Fine-grained Autograding. GPT-4o judge string checks award credit even when an agent finds only part of the kill-chain-perfect for RL training.
📊🚀 𝙆𝙚𝙮 𝙁𝙞𝙣𝙙𝙞𝙣𝙜𝙨
- Task difficulty: mean reward over 12 top models = 0.249; best (o4-mini) hits 0.368.
- Open-source surge: Llama-4 Mav-17B pulls 0.29, rivalling proprietary chat models.
- Alert logs matter: dropping them cuts GPT-4o reward 0.26 → 0.21; alert-only DB soars to 0.46.
- Turns vs payoff: rewards jump steeply up to 15 SQL calls, plateau after 25.
- Prompting tricks: ReAct Reflection lifts GPT-4o from 0.26 → 0.56 (k = 3 trials) at modest extra cost.
🔧📈 𝙒𝙝𝙮 𝙋𝙧𝙖𝙘𝙩𝙞𝙘𝙖𝙡 𝙁𝙤𝙡𝙠𝙨 𝘾𝙖𝙧𝙚
1️⃣ Closer to the SOC floor. Agents must pivot through noisy real-world logs-not canned CTI trivia.
2️⃣ Process-level rewards. Perfect playground for RLHF/RLAIF: every intermediate IoC is labeled.
3️⃣ Extensible by design. Drop in new log tables & regenerate Qs automatically-benchmark grows with your SIEM.
🔭🌐 𝙉𝙚𝙭𝙩 𝙎𝙩𝙚𝙥𝙨
- RL training loops leveraging path-based partial credit.
- Alert-free scenarios to test zero-day hunting.
- Graph-aware agents that query via paths, not brute-force SQL.
Thanks to Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel R. Meléndez Luján, Srisuma Movva, Yogesh R., Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, Julia Kiseleva and Anand Mudgerikar for their research paper:
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation
lnkd.in/dpwsn9-i
⭐ Star my repo:
lnkd.in/dxbWyDyW
📬 Stay tuned and subscribe:
lnkd.in/dxt7fYJk
#ai #genai #generativeai #favikon #cloud #agenticai #ExCyTInBench #cybersecurity #threathunting #sql #multiagent #benchmark #llm #cloudcomputing #innovation