AI is transforming , but DevOps and SRE are tough nuts to crack. security, privacy, and the fear of an AI accidentally taking down production hv been huge roadblocks until now
AOI (Autonomous Operations Intelligence) is a trainable multi-agent framework designed to make automated cloud operations secure, private, and capable of self-improvement
why this a big deal for the future of AIOps??
• Strict Read/Write Separation: Current LLM agents dangerously mix "read" and "write" permissions. AOI fixes this by splitting the workload among an Observer (read-only diagnosis), a Probe, and an Executor (which only handles gated write actions). This architectural separation ensures safe learning and prevents unauthorized system mutations.
• Local & Private: SRE environments are full of sensitive data that cannot be sent to closed-source frontier APIs. AOI solves this by using GRPO (Group Relative Policy Optimization) to distill expert-level operational knowledge into a highly capable 14B open-source model. You get expert reasoning without exposing proprietary data.
• Learning From Failure: Usually, when a closed AI system fails a task, that data is useless. AOI introduces a "Failure Trajectory Closed-Loop Evolver" that mines unsuccessful diagnostic runs and converts them into corrective supervision signals. The system literally learns from its own mistakes to continually refine its performance.
>>> and on the AIOpsLab benchmark, the base AOI runtime achieved a 66.3% success rate, outperforming the prior state-of-the-art by 24.4 percentage points. Furthermore, the locally deployed 14B model surpassed Claude Sonnet 4.5 on unseen faults, and the Evolver successfully reduced run-to-run variance by 35%.
this a massive step forward in building truly autonomous SRE agents that actually respect enterprise permissions and boundaries ./
AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis
Proposed in solving the main barriers to using LLM agents for Site Reliability Engineering (SRE) in real enterprises. These include no access to sensitive operational data, strict permission and safety rules that block risky actions, and closed systems that cannot learn from their own failures.
The core solution is AOI (Autonomous Operations Intelligence), a secure, trainable multi-agent framework that turns automated cloud operations into a trajectory learning problem. Its three key innovations are:
- a trainable diagnostic system using GRPO to distill expert knowledge into a small open-source 14B model without exposing proprietary data
- a read-write separated architecture that splits agents into Observer (read-only diagnosis), Probe, and Executor (gated write actions) for safety and auditability
- a Failure Trajectory Closed-Loop Evolver that mines failed runs and converts them into corrective training signals for continuous self-improvement in a closed environment.
On the AIOpsLab benchmark with 86 real-world SRE tasks, base AOI hits 66.3% best@5 success (beating prior state-of-the-art by 24.4 points). The GRPO-trained Observer achieves 42.9% avg@1 on unseen faults, outperforming Claude Sonnet 4.5, while the Evolver adds 4.8 points end-to-end and cuts variance by 35%. AOI shows the abilities of how to build truly autonomous, self-improving SRE agents that stay secure and respect permissions.
Full research paper:
arxiv.org/abs/2603.03378