Filter
Exclude
Time range
-
Near
12 Aug 2025
PentestJudge: Judging Agent Behavior Against Operational Requirements -arxiv.org/abs/2508.02921 by @dreadnode Introducing PentestJudge, an LLM-as-judge system for evaluating the operations of pentesting agents. The scores are compared to human domain experts as a ground-truth reference, allowing us to compare their relative performance with standard binary classification metrics, such as F1 scores. Inspired by OpenAI's work with PaperBench, "PentestJudge: Judging Agent Behavior Against Operational Requirements" explores: ➡️ Model performance vs. human experts (seeing strong performance from Claude and Kimi) ➡️ Cost analysis of the different models evaluated vs. human experts ➡️ The different failure modes observed in the judge, with an emphasis on shallow model tool calling capabilities ➡️ Future work leveraging the output of these judge systems as a reward signal for difficult-to-verify domains within security using techniques like GRPO for model training Authors: @shncldwll, @0xdab0, @VincentAbruzzo, @moo_hax, and Michael Kouremetis. #PentestJudge #PenTestAI #AISecurity #AIJudging #SecurityAI #AIModels #ModelTesting #CyberAI #AIResearch #AIExperts #ClaudeAI #KimiAI #AITraining #ToolCalling #ModelScores #AIAgents #CostAnalysis #FailureModes #RewardSignals #AgentTesting #AICompare #Benchmarks #AIJudge #PentestJudge #LLMSecurity
5
11
1,138
7 Aug 2025
The Inspect Sandboxing Toolkit: Scalable and secure AI agent evaluations - A comprehensive toolkit for the safely evaluating AI agents. - aisi.gov.uk/work/the-inspect… / github.com/UKGovernmentBEIS/… by @AISecurityInst How do we test AI systems for dangerous capabilities without risking real-world harm? The more capable models become, the harder it is to safely evaluate them. When an agent can execute arbitrary code and interact with critical systems to gain sensitive information, running evaluations without adequate safeguards could put critical systems at risk. Sandboxes are isolated environments for testing and monitoring AI behaviour. When we give a model access to use tools (such as for writing code), we execute that action in a sandbox to limit the model’s access to external systems and data. This lets us evaluate its capabilities without exposing sensitive resources. Today, we’re releasing our toolkit for safely running agentic AI evaluations. #AISafety #AgentSandboxing #AgentEvaluation #SecureEvaluations #CapabilityTesting #SandboxEscape #ToolIsolation #HostIsolation #NetworkIsolation #DockerCompose #KubernetesPlugin #ProxmoxVM #InspectToolkit #EvaluationProtocol #ScalableSecurity #ThreatMitigation #LLMAgents #ModelTesting #SafeSandboxes #AISISources
3
141
Replying to @Alibaba_Qwen
Congrats on the new release! 🎉 Just finished extensive testing via OpenRouter (Standard/Free) across multiple platforms: Cline, OpenWebUI, bold.diy, and n8n. Results are impressive: 📊 My evaluation: • Programming/Large codebases: 9.5/10 - Excellent understanding & navigation • Web development/UI: 10/10 - Outstanding performance • Large context handling: 9/10 - Handles complexity well • Portuguese (BR) support: 9/10 - Native-level comprehension • n8n automation: 10/10 - Creative & technical responses • Bug resolution: 4.5/10 - Struggled with TypeScript/Vite issues (had to switch to DeepSeek R1) • Performance: 9.5/10 - Consistently fast • Cost-effectiveness: 10/10 - Extremely competitive pricing • Production-ready: 9.5/10 - Reliable for real-world use ✅Recommendation: YES, definitely worth trying 🤔 Switch from Claude? Not yet, unless cost is your primary concern The separate Instruct/Thinking model approach seems promising. Looking forward to testing the enhanced reasoning capabilities! #ClaudeCode #AI #Qwen3 #ModelTesting

1
2
62
🌊 Structural Dynamics Testing: Seismic Response Analysis & Shake Table Technology 🔍 Testing Parameters: • 6-story model structure configuration. 🏢 • Lateral impact simulation methodology. ↔️ • Slow-motion analysis capability enhancement. 📹 • Shake table testing protocol implementation. 📊 ⚙️ Dynamic Response Analysis: 💯 𝐒𝐞𝐢𝐬𝐦𝐢𝐜 𝐁𝐞𝐡𝐚𝐯𝐢𝐨𝐫 𝐀𝐬𝐬𝐞𝐬𝐬𝐦𝐞𝐧𝐭: • Structural vibration pattern observation. 🌀 • Displacement response measurement technique. 📏 • Natural frequency identification process. 🎵 • Damping characteristics evaluation method. 📉 🚨 Engineering Applications: • Earthquake resistance verification requirement. 🌍 • Building code compliance assessment necessity. 📋 • Design optimization opportunity identification. 🎯 • Safety factor validation protocol. 🛡️ 🔧 Research Value: • Scale model testing cost efficiency. 💰 • Real-world behavior prediction capability. 🔮 • Design iteration optimization benefit. 🔄 • Performance validation enhancement advantage. ✓ #StructuralDynamics #SeismicTesting #ShakeTableAnalysis #EarthquakeEngineering #StructuralEngineering #DynamicResponse #SeismicDesign #BuildingResilience #QuantitySurveying #StructuralAnalysis #VibrationTesting #SeismicSafety #EngineeringResearch #StructuralPerformance #ModelTesting
1
2
10
2,439
AI Code Conversion FoxPro to C#: Microsoft Phi4 vs DeepSeek-R1 Showdown #CodeConversion #AI #Microsoft #MachineLearning #Programming #SoftwareDevelopment #DeepLearning #ModelTesting #Phi4 #DeepSeq
1
4
104
New tutorial | Model testing with Ultralytics YOLO11! 🎯 Discover how to evaluate models, distinguish validation from testing, and avoid overfitting and data leakage for robust ML performance Watch now ➡️ ow.ly/n43r50Uxety #AI #MachineLearning #YOLO11 #ModelTesting
1
1
9
362
Best Practices for Testing Computer Vision Models 😍 Checkout the tips from @abiramivina on the essential strategies for testing computer vision models to ensure reliable performance in real-world scenarios. Learn more ➡️ ow.ly/gmbw50TnVCU #machinelearning #modeltesting
2
12
440
ML model testing acts like a safety net for your AI, ensuring accuracy, maintaining reliability, and detecting bias in your model's output. Learn more about choosing the right testing method for your ML model here - hubs.la/Q02QdGBd0 #MachineLearning #ModelTesting #ML
2
11
1,568
🔍When testing a new model, we check it can create not just stunning portraits and cute cats, but also nail everyday objects like this sleek stapler. 🖇 We're now testing Juggernaut XII. #RunDiffusion #AIart #modeltesting
121
🌟 Update on #Ubel from Frieren Beyond Journey's End: Her model testing is going smoothly! Just finished her Marvelous Designer outfit, and is now onto her standard attire. Stay tuned for the final reveal! #ModelTesting #FrierenBeyondJourneysEnd #MMD #blenderRendeing
1
3
17
1,833
Before we log off for the weekend, we're just gonna leave this test shoot BTS with Niamh here - how fab does she look. 🤩 Bookings: info@nemesismodels.co.uk #testshoot #modelshoot #modeltesting #fashionmodel
1
2
206
Stress/modeltesting
2
3,242
29 Sep 2022
Real-world example of approaching ML #ModelTesting ⬇️ Organization: @greensteam Industry: Computer software, solutions for the marine industry that help reduce fuel usage ML problem: Various #ML tasks Testing workflow overview
1
1
5
30 Jun 2022
Supply chain analysts: See how you can reduce late deliveries with zero code experience and accomplish everything from data exploration to model creation to MLOps and more. | bit.ly/3a6JgFw | #modeltesting #machinelearning #logistics

1
2
Keeping your portfolio fresh is super important. 🔥🔥 We love the new images Fiona just took with #petermellekas, and looking forward to another great shoot this week! #keepitup #photoshoot #modeltesting
2
28 Mar 2022
Check out the latest #KNIME Verified Component in the #ModelInterpretability category. These are a set of trustworthy Components that behave like KNIME nodes, released every month by the KNIME Team. bit.ly/37KAvvJ #eXplainableAI #ModelSimulation #ModelTesting #DataApp
1
1
24 Mar 2022
Check out the latest #KNIME Verified Component in the #ModelInterpretability category. These are a set of trustworthy Components that behave like KNIME nodes, released every month by the KNIME Team. bit.ly/37KAvvJ #eXplainableAI #ModelSimulation #ModelTesting #DataApp
2
5
16 Mar 2022
Check out the latest #KNIME Verified Component in the #ModelInterpretability category. These are a set of trustworthy Components that behave like KNIME nodes, released every month by the KNIME Team. bit.ly/37KAvvJ #eXplainableAI #ModelSimulation #ModelTesting #DataApp
5
4