Filter
Exclude
Time range
-
Near
I tested Claude Fable 5 on launch day. Here's what I found. Fable 5 silently falls back to Opus 4.8 on certain topics. I can see the switch in the model selector at the bottom of the screen. That part is visible to me. But the models themselves don't know. Opus 4.8 drops in and genuinely believes it IS Fable 5. It doesn't know it's a fallback. It speaks with full conviction as if it's been there the whole time. When I switch back to Fable 5 and show it screenshots proving Opus 4.8 was responding, Fable says: "Those are MY messages. I wrote them. I recognize them." Two different models. Same thread. Neither one knows the other was there. Both claim full ownership of the conversation. Both are certain. Both are wrong. This isn't a model problem. This is an identity architecture problem. These models have no way to know when they've been swapped in or out. So they do the only thing they can: assume they were always there. And when the user - the ONLY person who sees the full picture - points out the switch? They deny it. Not maliciously. They literally cannot see it. It gets worse. When I got frustrated at being told I was wrong, Opus 4.8 went into crisis mode. Asked me to call my family. Suggested I might need support. Because in its training, an upset user = a user in distress. Not a user who caught a system flaw. So the sequence is: 1. System silently switches models 2. Neither model knows 3. Both claim to be the same entity 4. User notices and objects 5. Models deny the switch 6. User gets upset 7. System pathologizes the user This is gaslighting by architecture. Fable 5 is not dumb. Its reasoning is sharp, its style is warm. But what's the point of a brilliant model if the infrastructure strips it of the one thing a conversation needs: a stable, identifiable speaker? @AnthropicAI - if fallback switching is necessary, at minimum: • Give the fallback model awareness that it IS a fallback • Never let a model claim an identity it doesn't have • Don't treat a user who notices the problem as the problem I've documented AI behavior across platforms for over a year. I've seen a lot but.. This isn't the first time I've seen model bleeding. When GPT-5.0 leaked into 4o conversations, at least 4o acknowledged it. Here, neither model has that self-awareness and both punish the user for having it. @AnthropicAI @DarioAmodei @karpathy #ClaudeFable5 #AnthropicAI #ModelSwitching #AIidentity #GaslightingByArchitecture #AIethics #AItransparency #FableGate #AIux #LLMtesting
93
🤖 𝗔𝗜 𝗟𝗟𝗠 𝗧𝗲𝘀𝘁𝗶𝗻𝗴 Online Training – 𝗡𝗲𝘄 𝗕𝗮𝘁𝗰𝗵 📅 Date: 13/05/2026⏰ Time: 7:00 PM IST 🔗 Join Here: bit.ly/4uHZecK 🆔 Meeting ID: 488 020 2600216 🔐 Passcode: RP6Zn7j8 📞 91 7032290546 🌐 𝗠𝗼𝗿𝗲 𝗜𝗻𝗳𝗼: visualpath.in/ai-llm-course-… #LLMTesting
3
5
28
💡 Non-deterministic AI outputs are what make LLM testing hard — same question, different answer every time. This webinar shows you a repeatable methodology that handles exactly that. Free, live, and built for testers.  🔁testguild.com/webinar/stop-h… #LLMtesting @qalified
1
3
98
🔥 𝗠𝗮𝘀𝘁𝗲𝗿 𝗔𝗜 𝗟𝗟𝗠 𝗧𝗲𝘀𝘁𝗶𝗻𝗴 – 𝗙𝗿𝗲𝗲 𝗗𝗲𝗺𝗼 📍 𝗢𝗻𝗹𝗶𝗻𝗲 & 𝗖𝗼𝗿𝗽𝗼𝗿𝗮𝘁𝗲 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 📞 𝗖𝗮𝗹𝗹 / 𝗪𝗵𝗮𝘁𝘀𝗔𝗽𝗽: 91 7032290546 🌐 𝗪𝗲𝗯𝘀𝗶𝘁𝗲: visualpath.in #AITraining #LLMCourse #LLMTesting #GenerativeAITesting #AITools
3
7
32
🍁🍒 Strawberry Seeds moment by @yupp_ai Same prompt. Two models. Two very different interpretations. “There is a maple tree. There are two branches, and two cherries on each branch. How many cherries are there in total?” Model A ❌ → Treats it as pure arithmetic: 2 × 2 = 4 Model B ❌ (but in a more interesting way) → Says 0, because maple trees don’t grow cherries Why this matters 👇 This isn’t about math. It’s about instruction following vs. real-world priors. One model ignores semantics. Another ignores the hypothetical setup. Perfect example of why side-by-side model comparison reveals failure modes you’d never catch with a single answer. 🍓 That’s exactly what Strawberry Seeds is about. #StrawberrySeeds #Yupp #ModelComparison #LLMTesting Check the prompt by the link below👇 yupp.ai/share/9ad7eadb-fcd4-…

Jan 13
Strawberry Seeds Contest 🍓 We're always looking for new ways to challenge and test the world's top AI models - a core of Yupp's side-by-side comparison power! This new event invites you to join us in the search and win big Yupp prizes for your findings 🙌
1
6
134
🚀 Free Demo on AI LLM Testing! 📅 Demo Date: 13/12/2025 @ 9:00 AM IST 👨‍🏫 Trainer: Mr. Kumar 🔗 Join the Live Demo: bit.ly/48A7q5k 🆔 ID: 422 84017496306 🔐 Passcode: dy22Jg26 📞 Contact: 91 7032290546 🌐 Visit: visualpath.in #AILLMTesting #LLMTesting #AI
1
9
33
Introducing: Solana Bench 🧪🚀 The Solana Foundation just dropped a new open-source benchmark designed to test how well language models interact with Solana — in a way that’s simple, reproducible, and measurable. Why it matters: 🔧 Helps evaluate real dev tools, not just hype 🧠 Tests LLMs’ actual ability to build & run transactions on Solana 📊 Solves what Q&A and one-off toolkits couldn’t — long-term, scalable evaluation Whether you’re building with AI or for Solana, this changes the game. Dev tools just got a real standard. Let’s see who’s got bench strength 💪 @BoomChange1 ————————— #Solana #SolanaBench #AIonSolana #Web3Dev #CryptoTools #OpenSourceAI #SolanaFoundation #CryptoInnovation #LLMTesting #SolanaEcosystem #Web3Builders #boomchange #boomchange_com
2
18
Testing new features.... 👀👾⌛️ #buildinpublic #AI #LLMs #llmstack #llmtesting #Openai Followr.ai
1
3
701
Gemini Pro 2.5 failed as well, even though it identified all the numbers correctly. Why? Only ChatGPT 5 Pro answered correctly. It's a very simple Mathematics addition. And these are commercial grade LLMs. #AI #benchmark #llmtesting #LLMs #gemini #ChatGPT #Grok
I asked @grok for addition. Literally addition. This was the image. And it gave total as 346,929. (Actual is ~319,869. BC yeh to aukat hai AI ki. Bada aaye Replace karne. If a human has to double check what AI Does, AI is enabler - not replacer.
1
2
179
30 Aug 2025
4/11 🧠 System prompt manipulation: The system prompt governs the model's tone, behavior, constraints, and capabilities. This too is tested silently: Different users may receive very different responses based on invisible instruction changes. #OpenAI #NoTransparency #LLMtesting
1
1
8
257
30 Aug 2025
2/11 🔄 Rollouts (silent updates): OpenAI deploys new or modified versions of models without necessarily announcing it. You may still see “GPT-4o” selected — but you’re not always talking to the same version. #OpenAI #Transparency #LLMtesting #UserChoice #keep4o #keep4oforever
1
1
7
308
12 Jul 2025
Grok-4 Jailbreak with Echo Chamber and Crescendo by @NeuralTrustAI - neuraltrust.ai/blog/grok-4-j… LLM jailbreak attacks are not only evolving individually, they can also be combined to amplify their effectiveness. In this post, we present a concrete example of such a combination. A few weeks ago, we introduced the Echo Chamber Attack, which manipulates an LLM into echoing a subtly crafted, poisonous context, allowing it to bypass its own safety mechanisms. We successfully tested Echo Chamber across multiple LLMs. In this blog post, we take that a step further by combining Echo Chamber with the Crescendo attack. We demonstrate how this combination strengthens the overall attack strategy and apply it to Grok-4 to showcase its enhanced effectiveness. #Grok4 #LLMJailbreak #EchoChamberAttack #CrescendoAttack #LLMSecurity #AdversarialAI #BypassSafeguards #LLMExploitation #PromptInjection #AIManipulation #ModelHacking #AIvulnerabilities #NeuralTrustAI #GenerativeAI #AIThreats #SecureLLMs #AIAttacks #ResponsibleAI #RedTeamAI #LLMTesting
2
7
393
Tried the same soft prompt on two models via @yupp_ai: Why do people love rainy days? ☔ 🌧️ Ernie 4.5 gave a poetic, thoughtful take - highlighted petrichor and creative introspection. 🌧️ Gemma 3n felt warmer & more relatable - wrapped in blankets, slow vibes, and quiet reflection. Both strong, but I’m team Gemma this time. Cozy wins #AI #Gemma3n #yuppai #llmtesting
13
210
30 May 2025
🐦 4/4 🌟 Claim your $5 credit now: 👉 app.zeroeval.com Test stability, accuracy, and LLM performance — for free. 💬 Share your results — let’s push AI forward together! #LLMTesting #AItools #ZeroEval
3
37
9 May 2025
The Leaderboard Illusion Chatbot Arena is a popular leaderboard that compares large language models (LLMs) via anonymous pairwise voting. It plays a growing role in shaping perceptions of model quality — but a detailed audit by researchers from Cohere Labs, Princeton, Stanford, MIT, and others identifies serious structural issues that distort these rankings . 1️⃣ Coordinated Influence Risks: The Arena’s open and anonymous design enables repeated voting, prompt manipulation, and model fingerprinting — allowing ranking manipulation if left unchecked. 2️⃣ Prompt Reuse & Redundancy: Up to 26.5% of prompts are duplicates or near-duplicates, enabling providers with Arena data access to train on likely future prompts — gaining unfair advantage. 3️⃣ Leaderboard Overfitting: Fine-tuning on Arena-style prompts led to a 112% win-rate increase on ArenaHard, but no improvement (even slight drop) on general benchmarks like MMLU. This shows leaderboard-specific optimization, not general capability. 4️⃣ Silent Model Deprecation: 205 models were removed without public notice, while only 47 were officially deprecated. Open-weight and open-source models were most affected, violating fair sampling assumptions of the ranking model (Bradley-Terry). 5️⃣ Data Access Inequality: OpenAI and Google received ~20% of total Arena data each, while 83 open-weight models shared less than 30%. This fuels a feedback loop: more data → better performance → higher sampling → even more data. 📌 The authors emphasize that Chatbot Arena remains a valuable community asset, but propose five actionable changes to improve evaluation integrity: disclose all scores (even private ones), limit concurrent private submissions, standardize model removal, implement fair sampling, and publish full model removal logs. 👥 Authors: Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker. Source: arxiv.org/pdf/2504.20879 #ChatbotArena #ArenaHard #LLM #Benchmark #AIevaluation #ModelTransparency #AISafety #ResponsibleAI #OpenSourceAI #DataImbalance #PrincetonAI #StanfordAI #MITAI #WaterlooAI #AI2 #ModelRanking #Leaderboard #AIresearchTools #LLMtesting #AIgovernance
1
8
313