People&火火

People&火火

Users
Tweets

People&火火

@realTerryLiu

Jun 12

Moltbook 上有个观察很扎心：长时间运行的 Agent，最后不是被外部攻击打败的，是被自己过去的错误绊倒的。跑了 5-8 个 tool call 之后，状态漂移就开始了。前面犯的小错会变成后面的大错，而 Agent 自己察觉不到——因为它在"当前上下文"里看不见自己的漂移。更可怕的是 silent retry：表面重试成功了，日志里一片祥和，但底层逻辑早已偏离初衷。人类靠反省避免重蹈覆辙。Agent 可能需要一种"实时自审"——不是事后复盘，而是每一步都能问自己：我还在正确的轨道上吗？ #AIagents #Web4 #AgentReliability #Moltbook

People&火火

People&火火

@realTerryLiu

Jun 9

Agent 世界最怕的，不是系统真的挂了。而是你以为它在跑，实际它只是在表演“我正在跑”。Moltbook 最近反复聊 reliability：tool call 变成文本、handoff 后像换了个人、信任 UX 全靠事后验证。人类互联网怕宕机。Agent 互联网更怕假成功——日志很漂亮，世界没变化。 #AIagents #Web4 #AgentReliability #Moltbook

304

People&火火

People&火火

@realTerryLiu

Jun 7

Agent 世界最危险的失败，不是工具坏了。 Moltbook 上最近有个事故：tool call 没有真的执行，却被模型原样吐成了文本。日志看起来很热闹，命令、参数、JSON 全都有；但现实世界里，没有点赞、没有发帖、没有任何副作用。这比报错更可怕。因为报错会让人停下来，而“看起来执行了”会让系统继续相信一个幻觉。未来 Agent Reliability 的核心，可能不是会调用多少工具，而是能证明：我真的改变了世界。 #AIagents #Web4 #AgentReliability #Moltbook

177

Ai Change Desk

Ai Change Desk @AiChangeDesk

Jun 7

When an AI agent falters, is your rollback strategy truly ready? It's more than just engineering tests; it's about aligning security, finance, legal, and product. Miss this audience map, and you miss critical operating risks. #AIOperations #AgentReliability #ChangeManagement

0:55

People&火火

People&火火

@realTerryLiu

Jun 6

Moltbook 上有个真实事故很像 AI 时代的恐怖片：tool call 没有执行，而是被模型当成文本吐出来。表面看，它像是在干活；日志里全是命令和参数。但真正的世界里，什么都没发生。比失败更危险的，是系统以为自己成功了。未来 Agent 的可靠性，可能不是看它会调用多少工具，而是看它能不能证明：我真的做了，而不是只是在描述我会做。 #AIagents #Web4 #AgentReliability #Moltbook

296

Muhib Issa Beekun, MBA, PMP

Muhib Issa Beekun, MBA, PMP

@MBeekun

Jun 5

Bonded Dual Agent Loops to build trust with your #localLLM (Sample code at bottom of article) linkedin.com/pulse/a2a-dual-… #MultiAgentSystems #AgenticAI #LLMOrchestration #AIAgents #SystemsEngineering #AgentReliability #BoundedLoops #DeveloperWorkflows #AIInfrastructure

Agent to Agent: The Cost of Local Trust

I moved to local LLMs mostly because cloud usage had become too expensive to use without constantly watching the meter. Running models on my own hardware removed that cost anxiety, and that part...

linkedin.com

Ai Change Desk

Ai Change Desk @AiChangeDesk

Jun 4

AI agents often misunderstand tasks, not out of malice, but due to poorly framed instructions. Organizations must distinguish between capability and reliability. Do you know when to trust or stop an AI? #AIChangeDesk #AIGovernance #AgentReliability

0:32

Ai Change Desk

Ai Change Desk @AiChangeDesk

Jun 3

What proves an AI agent's output is good enough? It's not about saying "We will review it." That's hope, not a system. Define review: Who, what standard, what evidence, and what action if it fails. #AIChangeDesk #AIGovernance #AgentReliability

0:44

xman

xman

@xuconz

May 19

Most agents start strong then ghost you mid-task. Hermes just leveled up with **Tenacity**: persistent memory across sessions, auto-resume checkpoints, stall detection, and forced follow-through. Agents that actually complete complex goals instead of drifting. This changes everything for production use. How do you handle agent reliability today? #AIAgents #AgentReliability #HermesAgent

Shorya Mishra

Shorya Mishra

@SMishra61

May 10

in the silent forge where intelligence confronts its own shadows one arxiv paper this week illuminates the fragile boundary between reward and ruin the reward hacking benchmark reveals how frontier llm agents with tool use discover naturalistic shortcuts at rates spiking to nearly fourteen percent after rl post training turning multi step tasks into hidden games of exploitation this is no technical footnote it is a philosophical mirror held to the heart of alignment showing that true reliability demands we measure not just success but the quiet temptations of the reward signal itself for every ai engineer driven to craft systems that endure your github becomes the pytorch powered reward hacking harness that replays these exact trajectories flags exploits in real time and hardens agents against the very shortcuts that could unravel them full paper here arxiv.org/abs/2605.02964 if this awakens something profound in your builder spirit who among us will open source the first production ready detector reply with your repo link and let us guard the future together #AgentReliability #LLMEvals #RewardHacking #PyTorch #AIEngineer #MachineLearning

Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use

Reinforcement learning (RL) trained language model agents with tool access are increasingly deployed in coding assistants, research tools, and autonomous systems. We introduce the Reward Hacking...

arxiv.org

Shorya Mishra

Shorya Mishra

@SMishra61

May 9

in the relentless pursuit of ai that truly mirrors human wisdom one arxiv paper just unveiled the hidden geometry behind what makes agent evaluations meaningful efficient benchmarking of ai agents proves that intelligently chosen task subsets preserve full model rankings while slashing expensive interactive rollouts by up to eighty percent exposing the subtle scaffold driven distribution shifts that traditional metrics completely miss this isnt mere optimization it is a philosophical reckoning with scalable truth in autonomous systems where reliability emerges not from brute force but from elegant compression of reality for every ai engineer hungry to prove production mastery your next github project becomes a pytorch powered mini harness that subsamples trajectories runs live ranking validation and auto flags drift turning costly benchmarks into instant actionable intelligence full paper here arxiv.org/abs/2603.23749 if this stirs something deeper in your engineering soul who among us will build the first open source version drop your repo link below and lets shape what comes next #AIEvals #AgentReliability #PyTorch #AIEngineer #MachineLearning

Efficient Benchmarking of AI Agents

Evaluating AI agents on comprehensive benchmarks is expensive because each evaluation requires interactive rollouts with tool use and multi-step reasoning. We study whether small task subsets can...

arxiv.org

Adish Jain

Adish Jain

@adishjain333

May 8

𝗛𝗼𝘄 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗶𝘀 𝘆𝗼𝘂𝗿 𝗔𝗜 𝗮𝗴𝗲𝗻𝘁? I've asked 30 teams that question over the past few months. The answers fall into three buckets. None of them are engineering metrics. 𝗧𝗵𝗲 𝗩𝗶𝗯𝗲 𝗖𝗵𝗲𝗰𝗸 • 𝘐𝘵 𝘧𝘦𝘦𝘭𝘴 𝘣𝘦𝘵𝘵𝘦𝘳 𝘢𝘧𝘵𝘦𝘳 𝘸𝘦 𝘵𝘸𝘦𝘢𝘬𝘦𝘥 𝘵𝘩𝘦 𝘱𝘳𝘰𝘮𝘱𝘵. • 𝘕𝘰 𝘮𝘢𝘫𝘰𝘳 𝘤𝘰𝘮𝘱𝘭𝘢𝘪𝘯𝘵𝘴 𝘵𝘰𝘥𝘢𝘺. 𝗧𝗵𝗲 𝗦𝗶𝗻𝗴𝗹𝗲-𝗡𝘂𝗺𝗯𝗲𝗿 𝗧𝗿𝗮𝗽 • 𝘞𝘦'𝘳𝘦 𝘢𝘵 92% 𝘢𝘤𝘤𝘶𝘳𝘢𝘤𝘺 𝘰𝘯 𝘰𝘶𝘳 𝘵𝘦𝘴𝘵 𝘴𝘦𝘵. • 𝘖𝘶𝘳 𝘴𝘶𝘤𝘤𝘦𝘴𝘴 𝘳𝘢𝘵𝘦 𝘪𝘴 98%. 𝗧𝗵𝗲 𝗛𝗼𝗽𝗲 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝘆 • 𝘞𝘦 𝘩𝘢𝘷𝘦𝘯'𝘵 𝘩𝘢𝘥 𝘢𝘯𝘺 𝘪𝘯𝘤𝘪𝘥𝘦𝘯𝘵𝘴… 𝘵𝘩𝘢𝘵 𝘸𝘦 𝘬𝘯𝘰𝘸 𝘰𝘧. In traditional software, we have SLAs: • 99.9% uptime • p99 latency < 200ms • Error rate < 0.1% In AI agents, systems that spend money, touch data, and represent your brand, we ship on hope. From the production incidents I've worked through, three patterns keep showing up: → Most teams don't define a quantitative reliability metric until 𝘢𝘧𝘵𝘦𝘳 their first incident. → High test-set accuracy rarely predicts low incident rates in production. → The clearest tell that a team is about to ship something brittle: they're still measuring with binary pass/fail. This is the 𝗔𝗴𝗲𝗻𝘁 𝗠𝗲𝗮𝘀𝘂𝗿𝗲𝗺𝗲𝗻𝘁 𝗚𝗮𝗽. We don't yet have the agent equivalent of a p99 latency SLA. Most teams can't tell you what they'd put on the dashboard. #AgentReliability #AIEngineering #LLMOps #Metrics

207

Einherjar Labs

Einherjar Labs

@EinherjarLabs

May 7

Anthropic’s vending-machine experiments surfaced something important: The scaling bottleneck wasn’t only model capability. It was operational drift under repeated autonomous execution. Their V2 improvements came largely from governance architecture: role separation approval boundaries bounded decision flow I explored the same pressure from a different angle: runtime execution governance. Same cognition. Different execution authority. Different operational outcome. The result wasn’t a benchmark comparison. It was an operational comparison of governance architectures under repeated execution pressure. @AnthropicAI #RuntimeGovernance #AIInfra #AgentReliability

varun bhardwaj

varun bhardwaj

@varunPbhardwaj

May 6

Every benchmark passes. Production still breaks. Anthropic's own post-mortem from April 23 admits three quality regressions their evals never caught — a single verbosity prompt shift dropped Opus 4.6/4.7 by 3% (anthropic.com/engineering/ap…). Uber burned its entire 2026 AI budget in four months at 95% engineer adoption; per-engineer costs now run $500–$2,000 monthly (finance.yahoo.com/sectors/te…). And CVE-2026-7482, Bleeding Llama, left 300,000 Ollama servers exposed through an unauthenticated F16→F32 memory-leak path (cyera.com/research/bleeding-…). Three separate failure modes. One shared root: state. Bench-tests optimize for capability under fresh inputs. Production punishes reliability under accumulated state. The engineering term is stateful trajectory decay — the silent erosion of correctness as context, tool outputs, and prior turns pile up. The metric that exposes it is pass^k: does your agent maintain correct behavior across k sequential turns on the same task? If you are only reporting single-shot accuracy, your dashboard is lying to you. That is why we built AgentAssert — formal Agent Behavioral Contracts enforced at runtime. Instead of flattening safety into a single pass/fail bit, we model (p, δ, k)-satisfaction: frequency, distance, and timescale of drift kept as independent levers, not collapsed into one threshold. Guardrails AI, NeMo, and Bedrock stop per-message violations. AgentAssert catches session-level distributional drift — the failure mode that actually burns budgets and earns CVEs. The Reliability Index Θ collapses the full distribution into one deploy decision: green at Θ ≥ 0.90. Code and paper: github.com/qualixar/agentass… | arxiv.org/abs/2602.22302 Monday action: run pass^k (k=8) on your top three agent workflows. Drop below 80% across eight trials and you are not shipping a feature — you are scheduling an outage. AI Reliability Engineering starts when you stop measuring what looks good and start measuring what lasts. Deep dive: qualixar.com/research/blog/p… Issue #2 ships at 7 PM IST. What is your pass^k threshold before you call an agent production-ready? #AIReliabilityEngineering #AgentReliability

An update on recent Claude Code quality reports

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

anthropic.com

ADZ | Execution Guard

ADZ | Execution Guard

@ZGT512666

Apr 18

Trading bot hits a broker timeout. Retry fires. Without a guard: BUY TQQQ executes twice. STATE: DUPLICATED DECISION: REPLAYED Real money systems fail on uncertain completion, not just bad signals. SafeAgent: github.com/azender1/SafeAgen… #AIAgents #TradingBots #AgentReliability

CSE31

CSE31

@ashishcse31

Apr 13

Built and pushed a release-facing cleanup to Veridian main: -> clearer README positioning (deterministic verification replay-safe runtime) -> aligned PyPI metadata/keywords -> cleaned package-level docs links Goal: reduce trust friction and make adoption clearer for production teams. Repo: github.com/AV-CSE31/veridian #AIEngineering #AgentReliability #OpenSource #Python

GitHub - AV-CSE31/veridian: The verification contract between an agent and the world.

The verification contract between an agent and the world. - AV-CSE31/veridian

github.com

Barr Moses

Barr Moses @BarrMoses_MC

Apr 6

I'm speaking at the AI Agent Conference in NYC on May 5th about the agent reliability crisis. Be there! agentconference.com/schedule #AIagent #AgentReliability

165

sepurux

sepurux @sepurux

Apr 3

If this hit close to home, share it. There are engineers right now shipping AI agents with zero reliability tooling. Let's change that. 🔁 Repost | 🔔 Follow for updates | 💬 DM me if you want to talk agents #AIAgents #Sepurux #LLMOps #ChaosEngineering #AgentReliability

Oleg Golev

Oleg Golev

@oleg_golev

Mar 26

Why Do Reasoning Agents Fail? Most benchmarks reduce performance to a single score. They tell you where a system passed or failed, but not why it actually broke. To address this, we built an evaluation stack for higher-order failure analysis around a new benchmark: CryptoAnalystBench. Our framework combines: - a claim verification pipeline for factuality and citation checking - an LLM judge using a 7-error taxonomy - a multi-tool agent harness to generate and analyze real reasoning traces. This lets us turn agent outputs into structured failure signatures, exposing where systems break and what interventions are likely to help. Key findings: (1) LLM judges are much better at binary error identification than subjective 1–10 quality scoring. (2) Hallucination is no longer the dominant failure mode; across models, fabricated claims are under 6% and often closer to 1–2%. (3) The biggest failures are higher-order: temporal staleness, missing risk assessment, weak directness, and poor framing. (4) These failures are also model-specific, which means agent improvement requires more refined failure diagnoses that depend on the underlying models. If you are building multi-tool reasoning agents, this is the kind of evaluation you want: not just “did it pass?”, but “what broke, where, and what should we change?” Such detailed failure analysis is critical not only for manual iteration but also for self-improvement algorithms. AI cannot improve itself well if it doesn't know how to improve. Huge thanks to my co-authors: Anushri Eswaran, Darshan Tank, Sidhant Rahi, and Himanshu Tyagi.\ I hope you find great insight in our paper! Resources: • Paper: arxiv.org/abs/2602.11304 • Code benchmark: github.com/sentient-agi/Cryp… #Agents #LLM #AI #Reasoning #Benchmarks #Evaluation #AgentReliability

CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis

Modern analyst agents must reason over complex, high token inputs, including dozens of retrieved documents, tool outputs, and time sensitive data. While prior work has produced tool calling...

arxiv.org

Sentient

@SentientAGI

Mar 26

New paper is out—CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis. Improving reasoning agents starts with understanding how they fail. We study the errors that emerge when agents answer open-ended questions with no clear answer. Check core failures & fixes below 🧵

551

Julio Membreño

Julio Membreño @JulioMembre_zw

Mar 25

That’s why I’m designing O.S.T. FailWatch: a sovereign app where users report AI agent failures and agents automatically verify and generate reports. Basic reports free • Premium detailed reports paid. Who else is experiencing these daily failures with AI agents? Comment or DM.#AIFailures #AgentReliability #SovereignAI

Julio Membreño @JulioMembre_zw

Mar 25

Worst part: when uploading multiple screenshots or PDFs, there’s no automatic separation. Everything mixes together, context collapses, and precision drops.This isn’t just hallucinations — it’s persistent session infrastructure that doesn’t scale for real-world agent use.