🚨 Zhejiang University just open sourced a complete framework for training AI agents that control your phone by looking at the screen.
Not through APIs. Not through special integrations. Through taps, swipes, and keystrokes — exactly like a human would.
It's called ClawGUI. And it works on Android, HarmonyOS, and iOS out of the box.
Here's why this matters.
Every AI agent you've seen demo'd — the ones booking flights, ordering food, sending emails — works through APIs and programmatic access. The app has to be integrated. The developer has to build the connection. Anything without an API is off-limits.
Most apps don't have APIs. Most software on your phone was never designed for AI control. The long tail of applications — the obscure tools, the enterprise software, the legacy apps — is completely inaccessible to current AI agents.
GUI agents fix this by doing what humans do. They look at the screen. They read the interface. They decide where to tap. No API required. Any app that works for a human works for a GUI agent.
ClawGUI is the first complete open-source infrastructure to build, train, evaluate, and deploy these agents — all in one framework.
Here's what it actually includes:
→ ClawGUI-RL: First open-source GUI agent reinforcement learning infrastructure with support for both virtual environments and real physical devices simultaneously
→ ClawGUI-Eval: Standardized evaluation pipeline across 6 benchmarks and 11 models — 95.8% reproduction against official baselines, meaning you can actually trust the numbers
→ ClawGUI-Agent: Deploys trained agents to real devices through 12 chat platforms — chat with your phone agent through WhatsApp, Telegram, Slack, or whatever you already use
Here's the wildest part.
They trained ClawGUI-2B — a 2 billion parameter model — entirely within this pipeline. On MobileWorld GUI-Only, it achieves 17.1% success rate, beating the same-scale baseline by 6 full percentage points.
A 2B model. Controlling a phone. Trained end-to-end in an open-source pipeline anyone can reproduce.
Here's why the infrastructure matters more than the benchmark.
The reason GUI agents haven't taken off despite years of research isn't capability — it's fragmentation. Training pipelines are closed. Evaluation metrics drift between papers so you can't compare results. Trained models never reach real devices.
Every team building GUI agents has been rebuilding the same infrastructure from scratch. ClawGUI removes that bottleneck entirely. Train, evaluate, and deploy to a real phone from a single open-source framework.
No closed pipelines. No proprietary training infrastructure. No results you can't reproduce.
100% Open Source. Model available on Hugging Face now.
GitHub link in the comments 👇