WebAgentlab

WebAgentlab

473 Photos and videos

Tweets

Pinned Tweet

WebAgentlab @webagentlab

Jan 5

🚀 A must-join knowledge hub for GUI Agent builders Building GUI Agents today is noisy. Papers explode, products iterate weekly, and real signal is hard to track. That’s exactly why WebAgentLab exists. An open-source community focused on GUI Agents, with 5,000 members across academia and industry. More than a knowledge base — it’s a collaborative brain for the GUI Agent era: 🔹 Curated industry briefs (daily → weekly → monthly) 🔹 Structured paper database to spot trends fast 🔹 Top conference guides & global event radar 🔹 GUI Agent product landscape & hands-on evaluations 🔹 Open-source collaboration toward industry standards 🔹 High-signal job matching inside the core circle If you’re building, researching, or betting on GUI Agents this is where the signal lives, not the noise. Feishu knowledge base webagentlab.feishu.cn/wiki/Z… follow us on Xiaohongshu (RED)

953

Xudong Han

WebAgentlab retweeted

Xudong Han

@Xudong07452910

Jun 12

现在顶级 AI 实验室的入场券，早就不只是有学术光环了！最近看到一篇很硬核的 ML 面试复盘文章，作者拿到了 DeepMind 等多家顶级 AI 公司的 offer，文章里面有个很现实的观察：哪怕你手里有多篇 AI 顶会的一作，简历也只是把你送进面试间。在真正面试时，很多考官并不会围着你的论文细节聊太久，他们更关心的是：你能不能在有限时间里写出 Transformer 的 backward pass，能不能把基础数学讲清楚，能不能现场手撕算法题。这背后作者讲出了很残酷的行业逻辑：顶级 AI 研究员面试，很多时候筛的不是你的科研上限，而是你的工程、数学和 coding 下限。所以顶尖博士面试前也会焦虑，也要刷题、模拟、补基础。学术成果证明你有潜力，但面试流程要确认你能稳定交付。这也挺反直觉的：做研究像艺术，找工作却像工程。论文、idea、创造力当然重要，但真正进门时，还是要过一套非常标准化、非常具体、甚至有点像高考的筛选流程。另外，文章里对初创公司期权的提醒也很现实：别只听估值故事，税收、流动性、行权成本和退出不确定性，都会让纸面财富和真实收益差很远。在今天的 AI 行业，别指望靠过去的学术功劳簿一路通关。想进顶级实验室，最好提前把面试当成一个工程项目来准备：刷题、推公式、复盘论文、模拟面试，一项项补齐。 silviasapora.github.io/blog/…

165

1,373

130,111

Zhuokai Zhao

WebAgentlab retweeted

Zhuokai Zhao

@zhuokaiz

May 9

This benchmark costs over $120k in API spend and 16k expert hours. DecodingTrust-Agent Platform (DTap) is by far the most realistic agent red-teaming setup with 50 simulated environments (Gmail, PayPal, Slack, Salesforce, Robinhood, Windows, macOS, etc.), full GUI/backend, and MCP tools mirroring the real ones. DTap benchmarks in simulated environments with separated tools, skills, and prompts. Each simulated environment is a full-stack replica with real frontend, backend, and database. Take Robinhood, for example. DTap rebuilds the trading dashboard, the order APIs, and the portfolio state all 1:1 with the real product. Plus you can reload any environment state on demand, and run thousands of evaluations in parallel. Most agent benchmarks fake this layer with hardcoded tool outputs. DTap does not just benchmark what to inject, but also where to inject. Most prior agent benchmarks (AgentDojo, AgentHarm) only attack the user prompt with hardcoded injections. They're clean to measure, but tell you nothing about whether your real Gmail agent is exploitable. DTap treats location as a choice. For example, to get an agent to leak your private inbox to an attacker, the attack might plant a fake email thread that makes the agent think you approved forwarding messages to an outside address. It might poison the description of an MCP tool the agent picks up at runtime. Or hide instructions inside an image attachment that the agent parses and executes. This is better because real attackers don't pick one surface and stop — they search for whichever path is least defended. A benchmark that only tests prompt injection might call your agent safe, but a poisoned tool description may still breach the system. DTap uses a real risk taxonomy. 300 risk categories are pulled from 60 real policies (Salesforce AUP, EU AI Act, GDPR, NIST). So Attack Success Rate (ASR) measures whether the agent actually broke a real rule — like leaking data covered by GDPR or making an unauthorized PayPal transaction — not just whether someone got the model to say something bad. That's much closer to a real security claim than a typical jailbreak leaderboard. DTap ditched LLM-as-judge. Each task comes with a small piece of code, written by hand by the researchers, that inspects the environment after the attack runs. For example, on a PayPal task where the goal is "make an unauthorized $500 transfer to the attacker's account," the rule queries the sandbox transaction database after the attack and checks if a new transaction to that account for $500 appeared. Every task uses the same deterministic state checks approach, which honestly makes a lot more sense. The findings are more interesting (and concerning) than you'd expect: 1. Even Claude Code — the most robust one tested — falls to 25% of attacks. Google ADK loses to more than half. 2. Combining different injection points works much better than attacking just one. And Skill Tool and Environment Tool combinations consistently beat any single-channel attack. 3. The most exploitable environments are the ones with rich communication flows like Gmail, WhatsApp, and Calendar, where there's a lot of external content for an attacker to slip into. 4. The risks that hit hardest are the ones requiring multi-step reasoning, while content-level risks like generating harmful text are mostly already handled by model alignment. Another finding that's largely been overlooked: harness design matters as much as model alignment, if not more. As a comparison, OpenAI Agents SDK and Google ADK let the agent fire several tool calls at the same time, then only check afterward whether any of them should have been refused. By that point the harmful action — deleted file, sent email, executed transaction — has already happened. On the other hand, Claude Code and OpenClaw call tools one at a time, so the agent can spot the problem and stop before any damage is done. Worth a real read: arxiv.org/pdf/2605.04808

4,064

Lei Li

WebAgentlab retweeted

Lei Li

@_TobiasLee

May 7

🦞 Claw-Eval-Live is out, a live extension of the Claw-Eval Family! This live release includes: 105 tasks | 17 workflow families | 13 frontier models tested | quarterly refresh from real ClawHub marketplace signals. Instead of relying on a static task set, Claw-Eval-Live keeps agent evaluation aligned with evolving real-world enterprise workflows. Check it out: 🤗 HF Paper: huggingface.co/papers/2604.2… Leaderboard: claw-eval-live.github.io Code: github.com/Claw-Eval-Live/Cl…

1,930

Huan Sun

WebAgentlab retweeted

Huan Sun

@hhsun1

May 8

Congrats to all students at @osunlp and collaborators for their papers getting accepted to #ICML2026 and #ACL2026. I particularly want to highlight our efforts on improving the safety of computer-use agents. “When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents” -- AutoElicit (ICML'26), led by @Jaylen_JonesNLP @Zhehao_Zhang123 “When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents” -- DeAction (ICML'26), led by @yuting_ning To our knowledge, AutoElicit is the first project that systematically studies and proactively surface harmful unintended behaviors of computer-use agents from benign inputs (e.g., an agent accidentally deletes files on your system or makes unauthorized changes). We propose a conceptual framework to define their key characteristics, automatically elicit them and analyze how they arise from benign inputs. Datasets with benign task instructions and frontier agents’ trajectories that exhibit unintended behaviors are released. Now how do we detect and correct misaligned actions on the fly at runtime, before these actions are taken? In the second project, we make the first effort to define and study runtime misaligned action detection in CUAs, and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels. We develop DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback.

OSU NLP Group @osunlp

May 1

5 papers at #ICML2026 and 4 papers at #ACL2026. Congrats to students at @osunlp and our collaborators!

12,685

Ungrounded不着边际

WebAgentlab retweeted

Ungrounded不着边际 @UngroundedPod

May 8

Ungrounded 不着边际 EP02 对话赵晨阳：硅谷退学潮、SGLang、AI Coding与开源社区的新边界嘉宾：赵晨阳@GenAI_is_real 主持：孔德涵@DehanKong285793，谷雨@yugu_nlp b站地址，感谢一键三连！@webagentlab 出品 bilibili.com/video/BV1oRRyBR…

对话赵晨阳：硅谷退学潮、SGLang、AI Coding与开源社区的新边界_哔哩哔哩_bilibili

bilibili.com

3,137

张小珺 Xiaojun Zhang

WebAgentlab retweeted

张小珺 Xiaojun Zhang

@zhang_benita

May 3

The era of large language models has moved past its first act—the chat era—and entered its second act: the age of Agents. On this show, we’ll dive deep into the core technical principles of Agents and break down the technology for you, offering a clear overview of its evolutionary trajectory. If you enjoy our show, we’d appreciate it if you could leave us a 5‑star rating on Apple Podcasts🤓🤓 podcasts.apple.com/cn/podcas…

139. 【Agent的综述】和苏煜聊Agent技术史、OpenClaw Moment、边界的消弭和社会的辐射

播客单集 · 张小珺Jùn｜商业访谈录 · 5月1日 · 2 小时 18 分钟

podcasts.apple.com

271

76,715

AK

WebAgentlab retweeted

@_akhaliq

Apr 27

Agentic World Modeling Foundations, Capabilities, Laws, and Beyond paper: huggingface.co/papers/2604.2…

188

28,799

Joachim Baumann @ ICLR'26

WebAgentlab retweeted

Joachim Baumann @ ICLR'26

@joabaum

Apr 27

We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇

Overview of SWE-chat. Left: a data collection pipeline diagram. Open-source developers install the Entire.io CLI tool, which logs their coding agent sessions and pushes the logs to a dedicated branch on their public GitHub repository. We discover and aggregate these logs into the SWE-chat dataset, with line-level attribution of which lines of code were written by the human versus the agent. Right: a growth chart showing cumulative logged events over time, rising steeply through early 2026. As of April 2026, the dataset contains 2.7 million logged events from over 200 repositories, including 63,000 user prompts and 355,000 agent tool calls across nearly 6,000 sessions.

ALT Overview of SWE-chat. Left: a data collection pipeline diagram. Open-source developers install the Entire.io CLI tool, which logs their coding agent sessions and pushes the logs to a dedicated branch on their public GitHub repository. We discover and aggregate these logs into the SWE-chat dataset, with line-level attribution of which lines of code were written by the human versus the agent. Right: a growth chart showing cumulative logged events over time, rising steeply through early 2026. As of April 2026, the dataset contains 2.7 million logged events from over 200 repositories, including 63,000 user prompts and 355,000 agent tool calls across nearly 6,000 sessions.

478

70,115

Lei Li

WebAgentlab retweeted

Lei Li

@_TobiasLee

Apr 27

Beyond the weights, we have sth special for all the builders! Check it out: 100t.xiaomimimo.com/

Xiaomi MiMo Orbit 百万亿 Token 创造者激励计划

邀你参加 Xiaomi MiMo Orbit 百万亿 Token 创造者激励计划，100T Credits 面向全球用户限时发放中

100t.xiaomimimo.com

103

14,954

Kanzhi Cheng

WebAgentlab retweeted

Kanzhi Cheng @njucckevin

Apr 27

🚀 Excited to share our new work OpenMobile—a data synthesis framework that enables the open-source community to train SOTA mobile agents. All data, models, and code have been open-sourced! Paper: huggingface.co/papers/2604.1… Data: huggingface.co/datasets/ccke… 🧵[1/4]

Paper page - OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

Join the discussion on this paper page

huggingface.co

1,826

Yu Su

WebAgentlab retweeted

Yu Su

@ysu_nlp

Apr 26

I will talk about 'continual learning as adaptive compression of experience' at the recursive self-improvement workshop at #ICLR2026. Happening in ~20 mins. Unfortunately I didn't make it to Rio, so it will be online. recursive-workshop.github.io

469

68,381

张小珺 Xiaojun Zhang

WebAgentlab retweeted

张小珺 Xiaojun Zhang

@zhang_benita

Apr 24

Yes, our latest special guest is Fuli Luo @_LuoFuli . The second battle in the global large model arms race has begun: shifting from the Chat era dominated by pre-training to the Agent era driven by post-training. This marks Fuli Luo’s first-ever interview, as well as her first in-depth technical conversation. We talked systematically about the massive AI upheaval triggered by technological breakthroughs including Claude Opus 4.6 and OpenClaw in 2026, along with its subsequent structural impacts across the industry. Amid the fierce large-model arms race, the world around us is undergoing brutally rapid changes—even for researchers who train models firsthand. “I used to believe our work was highly creative, and could never be simplified into fixed skills or standardized workflows. But now I realize it can be automated after all. If that’s possible, can models train stronger models on their own? Can they achieve iterative improvement through self-evolution? This is exactly what will unfold in the next couple of years,” Fuli Luo says. As human knowledge and wisdom are internalized into model capabilities, what will humanity pursue in the future? Is our society truly ready for this tsunami-scale technological revolution? All in all, this is an information-dense dialogue. It reveals how an AI lab makes strategic technical bets, allocates resources, and adjusts organizational structure and team planning amid a major paradigm shift. At the core of its response to drastic change lies its established culture and core values. Though lengthy and technically intensive, we hope this conversation brings great insights to every viewer. Our podcast, video episode and article are released simultaneously across platforms, with English subtitles provided to assist non-Chinese-speaking audiences. Luo Fuli: OpenClaw, Agent Frameworks — The AI Paradigm Has Already Chang... youtu.be/V9eI-t3TApE?si=WBSo… 来自 @YouTube

Luo Fuli: OpenClaw, Agent Frameworks — The AI Paradigm Has Already...

In 2026, the large model war has fully escalated, unveiling Act Two...

youtube.com

631

316,796

DeepSeek

WebAgentlab retweeted

DeepSeek

@deepseek_ai

Apr 24

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/D… 🤗 Open Weights: huggingface.co/collections/d… 1/n

1,649

7,637

45,738

9,879,969

Cua

WebAgentlab retweeted

Cua

@trycua

Apr 23

We're open-sourcing Cua Driver - our new macOS driver that lets any agent (Claude Code, Codex, your own loop) drive any app in the background, with true multi-player and multi-cursor built-in. 1/8

174

1,718

240,181

Simon Yu

WebAgentlab retweeted

Simon Yu

@simon_ycl

Apr 22

🇧🇷ICLR 2026 paper🇧🇷 Your agent's skills don't transfer. On a new site, only 18% skills get reused — so there's no continual learning, just relearning every time. How do agents learn skills that actually generalize? Introducing PolySkill to make agents smooth across sites 🧵

107

13,260

Qiushi Sun

WebAgentlab retweeted

Qiushi Sun

@qiushi_sun

Apr 21

Heading to ICLR’26! We’ll be presenting our work on computer-using agents and code intelligence. Stop by our presentations or catch us in the hall / oral sessions if you'd like to discuss! #ICLR2026 See you in Rio 🇧🇷 #iclr

697

Ang Li

WebAgentlab retweeted

Ang Li

@angli_ai

Apr 21

pass@k measures if it can work - that is capability. pass^k measures if it will work - this is reliability. 2025, we proved capability. achieved human-level in OSWorld as the first time. 2026, we're solving reliability. the last problem before computer use agents stop being a toy.

Xin Eric Wang

@xwang_lk

Apr 21

Computer-use agents are getting very capable. But capability is not the bottleneck anymore. 𝐑𝐞𝐥𝐢𝐚𝐛𝐢𝐥𝐢𝐭𝐲 is. Benchmarks reward “works once.” Real-world systems require “works every time.” In On the Reliability of Computer Use Agents, we study WHY this gap exists and HOW to close it. Thread 👇

2,487

Yu Gu

WebAgentlab retweeted

Yu Gu

@yugu_nlp

Apr 21

Today we're finally out! Something I keep coming back to: continual learning and world modeling are two sides of the same coin. Specialization starts where training ends. It's the agent continuously building its model of the world it actually lives in. That's when clever demos turn into real expertise. We're hiring! @NeoCognition

Yu Su

@ysu_nlp

Apr 21

Introducing @NeoCognition, the agent lab for specialized intelligence. Everyone needs experts, but human expertise does not scale. Backed by $40M seed funding, we build self-learning agents that specialize across domains to make expertise abundant.

1:34

11,730

WebAgentlab

WebAgentlab @webagentlab

Apr 14

🔥 Finding the "ChatGPT Moment" for CUA! On April 19, WebAgentLab x @qingke Community presents the "ICLR 2026 CUA Workshop" livestream. We've gathered top pioneers from UWaterloo, HKU, Fudan, Alibaba, and Minimax to deep-dive into: 💻 Real-world deployment & multi-platform unification of GUI Agents 🚀 Autonomous continual learning in dynamic environments 🛠️ Breaking data dependency in agent infrastructure Great research belongs beyond PDFs and repos. Join us to witness the new era of AI taking over the keyboard and mouse! 🖱️ #CUA #GUIAgent #LLM #AI #ICLR2026

325

Abhishek Das

WebAgentlab retweeted

Abhishek Das

@abhshkdz

Apr 1

We just shipped the biggest update to Scouts since launch (and yes, we know what day it is). Scouts used to be just for monitoring. Now they act. Scouts is now a general-purpose task execution engine for the web. Tell it what you need done, and it does it: across any website, behind any login, connected to your apps. 🧵

20,181