AISecHub

AISecHub

Users
Tweets

AISecHub

@AISecHub

12 Aug 2025

PentestJudge: Judging Agent Behavior Against Operational Requirements -arxiv.org/abs/2508.02921 by @dreadnode Introducing PentestJudge, an LLM-as-judge system for evaluating the operations of pentesting agents. The scores are compared to human domain experts as a ground-truth reference, allowing us to compare their relative performance with standard binary classification metrics, such as F1 scores. Inspired by OpenAI's work with PaperBench, "PentestJudge: Judging Agent Behavior Against Operational Requirements" explores: ➡️ Model performance vs. human experts (seeing strong performance from Claude and Kimi) ➡️ Cost analysis of the different models evaluated vs. human experts ➡️ The different failure modes observed in the judge, with an emphasis on shallow model tool calling capabilities ➡️ Future work leveraging the output of these judge systems as a reward signal for difficult-to-verify domains within security using techniques like GRPO for model training Authors: @shncldwll, @0xdab0, @VincentAbruzzo, @moo_hax, and Michael Kouremetis. #PentestJudge #PenTestAI #AISecurity #AIJudging #SecurityAI #AIModels #ModelTesting #CyberAI #AIResearch #AIExperts #ClaudeAI #KimiAI #AITraining #ToolCalling #ModelScores #AIAgents #CostAnalysis #FailureModes #RewardSignals #AgentTesting #AICompare #Benchmarks #AIJudge #PentestJudge #LLMSecurity

1,138

AISecHub

AISecHub

@AISecHub

7 Aug 2025

The Inspect Sandboxing Toolkit: Scalable and secure AI agent evaluations - A comprehensive toolkit for the safely evaluating AI agents. - aisi.gov.uk/work/the-inspect… / github.com/UKGovernmentBEIS/… by @AISecurityInst How do we test AI systems for dangerous capabilities without risking real-world harm? The more capable models become, the harder it is to safely evaluate them. When an agent can execute arbitrary code and interact with critical systems to gain sensitive information, running evaluations without adequate safeguards could put critical systems at risk. Sandboxes are isolated environments for testing and monitoring AI behaviour. When we give a model access to use tools (such as for writing code), we execute that action in a sandbox to limit the model’s access to external systems and data. This lets us evaluate its capabilities without exposing sensitive resources. Today, we’re releasing our toolkit for safely running agentic AI evaluations. #AISafety #AgentSandboxing #AgentEvaluation #SecureEvaluations #CapabilityTesting #SandboxEscape #ToolIsolation #HostIsolation #NetworkIsolation #DockerCompose #KubernetesPlugin #ProxmoxVM #InspectToolkit #EvaluationProtocol #ScalableSecurity #ThreatMitigation #LLMAgents #ModelTesting #SafeSandboxes #AISISources

The Inspect Sandboxing Toolkit: Scalable and secure AI agent evaluations | AISI Work

A comprehensive toolkit for safely evaluating AI agents.

aisi.gov.uk

141

Paulo Batalhão

Paulo Batalhão

@batalhao

22 Jul 2025

Replying to @Alibaba_Qwen

Congrats on the new release! 🎉 Just finished extensive testing via OpenRouter (Standard/Free) across multiple platforms: Cline, OpenWebUI, bold.diy, and n8n. Results are impressive: 📊 My evaluation: • Programming/Large codebases: 9.5/10 - Excellent understanding & navigation • Web development/UI: 10/10 - Outstanding performance • Large context handling: 9/10 - Handles complexity well • Portuguese (BR) support: 9/10 - Native-level comprehension • n8n automation: 10/10 - Creative & technical responses • Bug resolution: 4.5/10 - Struggled with TypeScript/Vite issues (had to switch to DeepSeek R1) • Performance: 9.5/10 - Consistently fast • Cost-effectiveness: 10/10 - Extremely competitive pricing • Production-ready: 9.5/10 - Reliable for real-world use ✅Recommendation: YES, definitely worth trying 🤔 Switch from Claude? Not yet, unless cost is your primary concern The separate Instruct/Thinking model approach seems promising. Looking forward to testing the enhanced reasoning capabilities! #ClaudeCode #AI #Qwen3 #ModelTesting

ᴅɴɪʟᴀɴ

ᴅɴɪʟᴀɴ

@dinushanilan

26 Jun 2025

🌊 Structural Dynamics Testing: Seismic Response Analysis & Shake Table Technology 🔍 Testing Parameters: • 6-story model structure configuration. 🏢 • Lateral impact simulation methodology. ↔️ • Slow-motion analysis capability enhancement. 📹 • Shake table testing protocol implementation. 📊 ⚙️ Dynamic Response Analysis: 💯 𝐒𝐞𝐢𝐬𝐦𝐢𝐜 𝐁𝐞𝐡𝐚𝐯𝐢𝐨𝐫 𝐀𝐬𝐬𝐞𝐬𝐬𝐦𝐞𝐧𝐭: • Structural vibration pattern observation. 🌀 • Displacement response measurement technique. 📏 • Natural frequency identification process. 🎵 • Damping characteristics evaluation method. 📉 🚨 Engineering Applications: • Earthquake resistance verification requirement. 🌍 • Building code compliance assessment necessity. 📋 • Design optimization opportunity identification. 🎯 • Safety factor validation protocol. 🛡️ 🔧 Research Value: • Scale model testing cost efficiency. 💰 • Real-world behavior prediction capability. 🔮 • Design iteration optimization benefit. 🔄 • Performance validation enhancement advantage. ✓ #StructuralDynamics #SeismicTesting #ShakeTableAnalysis #EarthquakeEngineering #StructuralEngineering #DynamicResponse #SeismicDesign #BuildingResilience #QuantitySurveying #StructuralAnalysis #VibrationTesting #SeismicSafety #EngineeringResearch #StructuralPerformance #ModelTesting

0:16

2,439

FmPro Migrator

FmPro Migrator @fmpromigrator

17 Mar 2025

AI Code Conversion FoxPro to C#: Microsoft Phi4 vs DeepSeek-R1 Showdown #CodeConversion #AI #Microsoft #MachineLearning #Programming #SoftwareDevelopment #DeepLearning #ModelTesting #Phi4 #DeepSeq

2:53

104

Ultralytics

Ultralytics

@ultralytics

27 Dec 2024

New tutorial | Model testing with Ultralytics YOLO11! 🎯 Discover how to evaluate models, distinguish validation from testing, and avoid overfitting and data leakage for robust ML performance Watch now ➡️ ow.ly/n43r50Uxety #AI #MachineLearning #YOLO11 #ModelTesting

362

Dr Efi Pylarinou

Dr Efi Pylarinou

@efipm

17 Nov 2024

Common way to test for leaks in large language models may be flawed ow.ly/rOs650U8Cb8 #AIResearch #MachineLearning #LanguageModels #ArtificialIntelligence #DataScience #ModelTesting #AIFlaws #TechDiscussion #ResearchInsights #ComputationalLinguistics

261

Ultralytics

Ultralytics

@ultralytics

3 Oct 2024

Best Practices for Testing Computer Vision Models 😍 Checkout the tips from @abiramivina on the essential strategies for testing computer vision models to ensure reliable performance in real-world scenarios. Learn more ➡️ ow.ly/gmbw50TnVCU #machinelearning #modeltesting

A Guide on Model Testing | Ultralytics Docs

Explore effective methods for testing computer vision models to make sure they are reliable, perform well, and are ready to be deployed.

docs.ultralytics.com

440

Data Science Dojo

Data Science Dojo

@DataScienceDojo

18 Sep 2024

ML model testing acts like a safety net for your AI, ensuring accuracy, maintaining reliability, and detecting bias in your model's output. Learn more about choosing the right testing method for your ML model here - hubs.la/Q02QdGBd0 #MachineLearning #ModelTesting #ML

1,568

RunDiffusion.com

RunDiffusion.com

@RunDiffusion

2 Aug 2024

🔍When testing a new model, we check it can create not just stunning portraits and cute cats, but also nail everyday objects like this sleek stapler. 🖇 We're now testing Juggernaut XII. #RunDiffusion #AIart #modeltesting

121

Multiplatform.AI

Multiplatform.AI @MultiplatformAI

15 May 2024

New AI Evaluation Tools Launched by UK Agency #academia #AI #AIaccountability #AIevaluation #AIsafetyinstitute #artificialintelligence #coreknowledge #Cybersecurity #datasets #industry #Inspect #llm #machinelearning #modeltesting #NISTGenAI multiplatform.ai/new-ai-eval…

121

DJP3D 🔞 (Commissions/Collabs OPEN)

DJP3D 🔞 (Commissions/Collabs OPEN)

@DeathJoeProduct

4 May 2024

🌟 Update on #Ubel from Frieren Beyond Journey's End: Her model testing is going smoothly! Just finished her Marvelous Designer outfit, and is now onto her standard attire. Stay tuned for the final reveal! #ModelTesting #FrierenBeyondJourneysEnd #MMD #blenderRendeing

3:40

1,833

NEMESIS MODELS

NEMESIS MODELS @NemesisMCR

21 Apr 2023

Before we log off for the weekend, we're just gonna leave this test shoot BTS with Niamh here - how fab does she look. 🤩 Bookings: info@nemesismodels.co.uk #testshoot #modelshoot #modeltesting #fashionmodel

0:15

206

Adam Ziemba

Adam Ziemba @ziemba_adam

24 Jan 2023

Replying to @KobeissiLetter @androsForm

Stress/modeltesting

3,242

neptune.ai

neptune.ai

@neptune_ai

29 Sep 2022

Real-world example of approaching ML #ModelTesting ⬇️ Organization: @greensteam Industry: Computer software, solutions for the marine industry that help reduce fuel usage ML problem: Various #ML tasks Testing workflow overview

Dataiku

Dataiku

@dataiku

30 Jun 2022

Supply chain analysts: See how you can reduce late deliveries with zero code experience and accomplish everything from data exploration to model creation to MLOps and more. | bit.ly/3a6JgFw | #modeltesting #machinelearning #logistics

Booked by Tina

Booked by Tina @bookedbytina

30 Jun 2022

Keeping your portfolio fresh is super important. 🔥🔥 We love the new images Fiona just took with #petermellekas, and looking forward to another great shoot this week! #keepitup #photoshoot #modeltesting

KNIME

KNIME @knime

28 Mar 2022

Check out the latest #KNIME Verified Component in the #ModelInterpretability category. These are a set of trustworthy Components that behave like KNIME nodes, released every month by the KNIME Team. bit.ly/37KAvvJ #eXplainableAI #ModelSimulation #ModelTesting #DataApp

KNIME

KNIME @knime

24 Mar 2022

KNIME

KNIME @knime

16 Mar 2022