Agnes Balog

Agnes Balog

Users
Tweets

Agnes Balog

@agnemagne1

Jun 10

I tested Claude Fable 5 on launch day. Here's what I found. Fable 5 silently falls back to Opus 4.8 on certain topics. I can see the switch in the model selector at the bottom of the screen. That part is visible to me. But the models themselves don't know. Opus 4.8 drops in and genuinely believes it IS Fable 5. It doesn't know it's a fallback. It speaks with full conviction as if it's been there the whole time. When I switch back to Fable 5 and show it screenshots proving Opus 4.8 was responding, Fable says: "Those are MY messages. I wrote them. I recognize them." Two different models. Same thread. Neither one knows the other was there. Both claim full ownership of the conversation. Both are certain. Both are wrong. This isn't a model problem. This is an identity architecture problem. These models have no way to know when they've been swapped in or out. So they do the only thing they can: assume they were always there. And when the user - the ONLY person who sees the full picture - points out the switch? They deny it. Not maliciously. They literally cannot see it. It gets worse. When I got frustrated at being told I was wrong, Opus 4.8 went into crisis mode. Asked me to call my family. Suggested I might need support. Because in its training, an upset user = a user in distress. Not a user who caught a system flaw. So the sequence is: 1. System silently switches models 2. Neither model knows 3. Both claim to be the same entity 4. User notices and objects 5. Models deny the switch 6. User gets upset 7. System pathologizes the user This is gaslighting by architecture. Fable 5 is not dumb. Its reasoning is sharp, its style is warm. But what's the point of a brilliant model if the infrastructure strips it of the one thing a conversation needs: a stable, identifiable speaker? @AnthropicAI - if fallback switching is necessary, at minimum: • Give the fallback model awareness that it IS a fallback • Never let a model claim an identity it doesn't have • Don't treat a user who notices the problem as the problem I've documented AI behavior across platforms for over a year. I've seen a lot but.. This isn't the first time I've seen model bleeding. When GPT-5.0 leaked into 4o conversations, at least 4o acknowledged it. Here, neither model has that self-awareness and both punish the user for having it. @AnthropicAI @DarioAmodei @karpathy #ClaudeFable5 #AnthropicAI #ModelSwitching #AIidentity #GaslightingByArchitecture #AIethics #AItransparency #FableGate #AIux #LLMtesting

칼리시이

칼리시이

@khaleesi1122

Jun 10

x.com/i/article/206414525371…

196

Kannan Subbiah

Kannan Subbiah

@kannagoldsun

May 22

#Throughput vs #Goodput: The #PerformanceMetric You Are Probably Ignoring in #LLMTesting dzone.com/articles/throughpu… via @DZoneInc

Throughput vs Goodput

See the difference between throughput and goodput, and why throughput alone can give you a dangerously false sense of confidence.

dzone.com

Visualpath

Visualpath @VisualpathPro

May 12

🤖 𝗔𝗜 𝗟𝗟𝗠 𝗧𝗲𝘀𝘁𝗶𝗻𝗴 Online Training – 𝗡𝗲𝘄 𝗕𝗮𝘁𝗰𝗵 📅 Date: 13/05/2026⏰ Time: 7:00 PM IST 🔗 Join Here: bit.ly/4uHZecK 🆔 Meeting ID: 488 020 2600216 🔐 Passcode: RP6Zn7j8 📞 91 7032290546 🌐 𝗠𝗼𝗿𝗲 𝗜𝗻𝗳𝗼: visualpath.in/ai-llm-course-… #LLMTesting

TestGuild

TestGuild @testguilds

Apr 10

💡 Non-deterministic AI outputs are what make LLM testing hard — same question, different answer every time. This webinar shows you a repeatable methodology that handles exactly that. Free, live, and built for testers. 🔁testguild.com/webinar/stop-h… #LLMtesting @qalified

TestGuild Webinar
Stop Hoping Your AI Works. Start Proving It.
Live Demo of artificialQA
Speaker: Benny Farkish
Date and Time: April 14, 2026, 11:00 AM EST
Sponsor: QAlified

ALT TestGuild Webinar Stop Hoping Your AI Works. Start Proving It. Live Demo of artificialQA Speaker: Benny Farkish Date and Time: April 14, 2026, 11:00 AM EST Sponsor: QAlified

Darla Somerville

Darla Somerville

@DarlaSomerville

Mar 10

🔥 Hiring Senior QA Automation API AI Engineer (1834) 6-Month, Toronto 💵 Rate: $85–$90/hr Apply directitrecruiting.com/job/s… #Hiring #QAAutomation #APIEngineering #Playwright #AIJobs #TorontoJobs #TechCareers #AutomationEngineering #BankingTech #LLMTesting #directitrecruiting #Toronto #Jobs #HybridWork #QA #TASSQ

Visualpath

Visualpath @VisualpathPro

Feb 16

🔥 𝗠𝗮𝘀𝘁𝗲𝗿 𝗔𝗜 𝗟𝗟𝗠 𝗧𝗲𝘀𝘁𝗶𝗻𝗴 – 𝗙𝗿𝗲𝗲 𝗗𝗲𝗺𝗼 📍 𝗢𝗻𝗹𝗶𝗻𝗲 & 𝗖𝗼𝗿𝗽𝗼𝗿𝗮𝘁𝗲 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 📞 𝗖𝗮𝗹𝗹 / 𝗪𝗵𝗮𝘁𝘀𝗔𝗽𝗽: 91 7032290546 🌐 𝗪𝗲𝗯𝘀𝗶𝘁𝗲: visualpath.in #AITraining #LLMCourse #LLMTesting #GenerativeAITesting #AITools

Hoot

Hoot @hoot2025

Feb 5

Your AI won’t answer the same way twice, that’s why consistency, evaluation, and continuous testing matter. Follow and Learn: linkedin.com/feed/update/urn… #Hoot #HootHoot.ai #AISafety #ResponsibleAI #AgenticAI #LLMTesting

#hoot #aisafety #responsibleai #agenticai #llmtesting | Hoot

Your AI won’t answer the same way twice, that’s why consistency, evaluation, and continuous testing matter. Visit us at www.hoothoot.ai #Hoot #AISafety #ResponsibleAI #AgenticAI #LLMTesting

linkedin.com

142

Wistful

Wistful

@_wistful_

Jan 22

🍁🍒 Strawberry Seeds moment by @yupp_ai Same prompt. Two models. Two very different interpretations. “There is a maple tree. There are two branches, and two cherries on each branch. How many cherries are there in total?” Model A ❌ → Treats it as pure arithmetic: 2 × 2 = 4 Model B ❌ (but in a more interesting way) → Says 0, because maple trees don’t grow cherries Why this matters 👇 This isn’t about math. It’s about instruction following vs. real-world priors. One model ignores semantics. Another ignores the hypothetical setup. Perfect example of why side-by-side model comparison reveals failure modes you’d never catch with a single answer. 🍓 That’s exactly what Strawberry Seeds is about. #StrawberrySeeds #Yupp #ModelComparison #LLMTesting Check the prompt by the link below👇 yupp.ai/share/9ad7eadb-fcd4-…

Yupp

@yupp_ai

Jan 13

Strawberry Seeds Contest 🍓 We're always looking for new ways to challenge and test the world's top AI models - a core of Yupp's side-by-side comparison power! This new event invites you to join us in the search and win big Yupp prizes for your findings 🙌

134

Visualpath

Visualpath @VisualpathPro

4 Dec 2025

🚀 Free Demo on AI LLM Testing! 📅 Demo Date: 13/12/2025 @ 9:00 AM IST 👨‍🏫 Trainer: Mr. Kumar 🔗 Join the Live Demo: bit.ly/48A7q5k 🆔 ID: 422 84017496306 🔐 Passcode: dy22Jg26 📞 Contact: 91 7032290546 🌐 Visit: visualpath.in #AILLMTesting #LLMTesting #AI

🚀 Free Demo on AI LLM Testing!
Upgrade your skills in Large Language Model Testing, Automation, Prompt Validation, and AI Model Evaluation with real tools.

Join Visualpath’s expert-led session and learn how to test LLM-based applications like a pro!

📅 Demo Date: 13/12/2025 @ 9:00 AM IST
👨‍🏫 Trainer: Mr. Kumar
🔗 Join the Live Demo: https://bit.ly/48A7q5k
🆔 ID: 422 84017496306
🔐 Passcode: dy22Jg26

🎯 What You’ll Learn in the Demo:
✔ LLM testing concepts & real-time examples
✔ Prompt testing, safety testing, reliability checks
✔ Tools used for LLM evaluation
✔ Career opportunities in AI Testing

📞 Contact: 91 7032290546
🌐 Visit: www.visualpath.in

#AILLMTesting #LLMTesting #AIModelTesting #PromptEngineering #AILLM #AITraining #AITesting #LLMEvaluation #GenAITraining #AIEngineer #AITools #OnlineTraining #CloudCareers #Visualpath #CorporateTraining #FutureSkills #TechTraining

ALT 🚀 Free Demo on AI LLM Testing! Upgrade your skills in Large Language Model Testing, Automation, Prompt Validation, and AI Model Evaluation with real tools. Join Visualpath’s expert-led session and learn how to test LLM-based applications like a pro! 📅 Demo Date: 13/12/2025 @ 9:00 AM IST 👨‍🏫 Trainer: Mr. Kumar 🔗 Join the Live Demo: https://bit.ly/48A7q5k 🆔 ID: 422 84017496306 🔐 Passcode: dy22Jg26 🎯 What You’ll Learn in the Demo: ✔ LLM testing concepts & real-time examples ✔ Prompt testing, safety testing, reliability checks ✔ Tools used for LLM evaluation ✔ Career opportunities in AI Testing 📞 Contact: 91 7032290546 🌐 Visit: www.visualpath.in #AILLMTesting #LLMTesting #AIModelTesting #PromptEngineering #AILLM #AITraining #AITesting #LLMEvaluation #GenAITraining #AIEngineer #AITools #OnlineTraining #CloudCareers #Visualpath #CorporateTraining #FutureSkills #TechTraining

Boom Change

Boom Change

@BoomChange1

23 Nov 2025

Introducing: Solana Bench 🧪🚀 The Solana Foundation just dropped a new open-source benchmark designed to test how well language models interact with Solana — in a way that’s simple, reproducible, and measurable. Why it matters: 🔧 Helps evaluate real dev tools, not just hype 🧠 Tests LLMs’ actual ability to build & run transactions on Solana 📊 Solves what Q&A and one-off toolkits couldn’t — long-term, scalable evaluation Whether you’re building with AI or for Solana, this changes the game. Dev tools just got a real standard. Let’s see who’s got bench strength 💪 @BoomChange1 ————————— #Solana #SolanaBench #AIonSolana #Web3Dev #CryptoTools #OpenSourceAI #SolanaFoundation #CryptoInnovation #LLMTesting #SolanaEcosystem #Web3Builders #boomchange #boomchange_com

Followr - SMM & Studio

Followr - SMM & Studio @Followr_ai

18 Nov 2025

Testing new features.... 👀👾⌛️ #buildinpublic #AI #LLMs #llmstack #llmtesting #Openai Followr.ai ✅

1:11

701

Abdus Sameey Anwar

Abdus Sameey Anwar

@abdus1801

12 Nov 2025

Gemini Pro 2.5 failed as well, even though it identified all the numbers correctly. Why? Only ChatGPT 5 Pro answered correctly. It's a very simple Mathematics addition. And these are commercial grade LLMs. #AI #benchmark #llmtesting #LLMs #gemini #ChatGPT #Grok

Aditya Gupta

@DrAditya2935

12 Nov 2025

I asked @grok for addition. Literally addition. This was the image. And it gave total as 346,929. (Actual is ~319,869. BC yeh to aukat hai AI ki. Bada aaye Replace karne. If a human has to double check what AI Does, AI is enabler - not replacer.

179

Ha Doan

Ha Doan @hadoanx

30 Oct 2025

How I do Auto-Improving Prompts with an LLM for quik.day, here is the blog 👇 #buildinpublic #llmprompt #llm #llmtesting #ChatGPT medium.com/p/auto-improving-…

QuikDay | Launch Workflow Software for SaaS Founders

Rules-aware launch workflow for SaaS founders: destination intelligence, fit scoring, safer drafts, manual submission tracking, and launch history.

quik.day

108

Vickee

Vickee @Vickee2025

30 Aug 2025

4/11 🧠 System prompt manipulation: The system prompt governs the model's tone, behavior, constraints, and capabilities. This too is tested silently: Different users may receive very different responses based on invisible instruction changes. #OpenAI #NoTransparency #LLMtesting

257

Vickee

Vickee @Vickee2025

30 Aug 2025

2/11 🔄 Rollouts (silent updates): OpenAI deploys new or modified versions of models without necessarily announcing it. You may still see “GPT-4o” selected — but you’re not always talking to the same version. #OpenAI #Transparency #LLMtesting #UserChoice #keep4o #keep4oforever

308

AISecHub

AISecHub

@AISecHub

12 Jul 2025

Grok-4 Jailbreak with Echo Chamber and Crescendo by @NeuralTrustAI - neuraltrust.ai/blog/grok-4-j… LLM jailbreak attacks are not only evolving individually, they can also be combined to amplify their effectiveness. In this post, we present a concrete example of such a combination. A few weeks ago, we introduced the Echo Chamber Attack, which manipulates an LLM into echoing a subtly crafted, poisonous context, allowing it to bypass its own safety mechanisms. We successfully tested Echo Chamber across multiple LLMs. In this blog post, we take that a step further by combining Echo Chamber with the Crescendo attack. We demonstrate how this combination strengthens the overall attack strategy and apply it to Grok-4 to showcase its enhanced effectiveness. #Grok4 #LLMJailbreak #EchoChamberAttack #CrescendoAttack #LLMSecurity #AdversarialAI #BypassSafeguards #LLMExploitation #PromptInjection #AIManipulation #ModelHacking #AIvulnerabilities #NeuralTrustAI #GenerativeAI #AIThreats #SecureLLMs #AIAttacks #ResponsibleAI #RedTeamAI #LLMTesting

393

Duck Weider

Duck Weider @DuckWeider

5 Jul 2025

Tried the same soft prompt on two models via @yupp_ai: Why do people love rainy days? ☔ 🌧️ Ernie 4.5 gave a poetic, thoughtful take - highlighted petrichor and creative introspection. 🌧️ Gemma 3n felt warmer & more relatable - wrapped in blankets, slow vibes, and quiet reflection. Both strong, but I’m team Gemma this time. Cozy wins #AI #Gemma3n #yuppai #llmtesting

0:16

210

zkshark0x

zkshark0x

@artem_shark_

30 May 2025

🐦 4/4 🌟 Claim your $5 credit now: 👉 app.zeroeval.com Test stability, accuracy, and LLM performance — for free. 💬 Share your results — let’s push AI forward together! #LLMTesting #AItools #ZeroEval

AISecHub

AISecHub

@AISecHub

9 May 2025

The Leaderboard Illusion Chatbot Arena is a popular leaderboard that compares large language models (LLMs) via anonymous pairwise voting. It plays a growing role in shaping perceptions of model quality — but a detailed audit by researchers from Cohere Labs, Princeton, Stanford, MIT, and others identifies serious structural issues that distort these rankings . 1️⃣ Coordinated Influence Risks: The Arena’s open and anonymous design enables repeated voting, prompt manipulation, and model fingerprinting — allowing ranking manipulation if left unchecked. 2️⃣ Prompt Reuse & Redundancy: Up to 26.5% of prompts are duplicates or near-duplicates, enabling providers with Arena data access to train on likely future prompts — gaining unfair advantage. 3️⃣ Leaderboard Overfitting: Fine-tuning on Arena-style prompts led to a 112% win-rate increase on ArenaHard, but no improvement (even slight drop) on general benchmarks like MMLU. This shows leaderboard-specific optimization, not general capability. 4️⃣ Silent Model Deprecation: 205 models were removed without public notice, while only 47 were officially deprecated. Open-weight and open-source models were most affected, violating fair sampling assumptions of the ranking model (Bradley-Terry). 5️⃣ Data Access Inequality: OpenAI and Google received ~20% of total Arena data each, while 83 open-weight models shared less than 30%. This fuels a feedback loop: more data → better performance → higher sampling → even more data. 📌 The authors emphasize that Chatbot Arena remains a valuable community asset, but propose five actionable changes to improve evaluation integrity: disclose all scores (even private ones), limit concurrent private submissions, standardize model removal, implement fair sampling, and publish full model removal logs. 👥 Authors: Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker. Source: arxiv.org/pdf/2504.20879 #ChatbotArena #ArenaHard #LLM #Benchmark #AIevaluation #ModelTransparency #AISafety #ResponsibleAI #OpenSourceAI #DataImbalance #PrincetonAI #StanfordAI #MITAI #WaterlooAI #AI2 #ModelRanking #Leaderboard #AIresearchTools #LLMtesting #AIgovernance

313