Teaching AI agents to use computers. Continuous evaluation and infrastructure platform for computer-use agents.

Joined March 2025
8 Photos and videos
Paradigm Shift AI retweeted
Over the past month I have been open sourcing most of what we built @ParadigmShiftAI . Today I am officially sharing that I have shut the company down and fully wound down operations. I am very proud of what we built in a short time: a real product for large scale computer use evaluations and infra to run agents and benchmarks at scale. This was not an easy decision. @ParadigmShiftAI was accepted into a top startup accelerator and had strong interest from parts of the agent ecosystem, but the risk, market timing, and where I have the deepest edge were not aligned with how I want to spend the next several years. I am grateful to everyone who gave feedback, tried the product, or took a bet on us. All of our work is now MIT open source for anyone to use or build on, and I will share the full list of repos and the dataset in the comments. Going forward my energy is moving back to a domain I know deeply. I am now building a new fintech x AI startup in stealth, better aligned with my prior experience as a fixed income portfolio manager before tech. If you are working in agents, evals, or fintech and want to jam on ideas or hear more about what I am building next, my DMs are open.
1
1
7
156
Paradigm Shift AI retweeted
🚀 Excited to announce the release of our computer-use dataset! 3,100 high-quality computer & browser tasks including videos, application logs, DOMs, screenshots, metadata - now open source on @huggingface Perfect for or RL or training AI agents that can use computers like humans 🤖💻 🔗 huggingface.co/datasets/anai… 🧵 Thread with details ⬇️
1
2
8
182
Paradigm Shift AI retweeted
🚀 Open sourcing @ParadigmShiftAI's computer-use agent evaluation infrastructure! 2 repos now available: 🔬 Neurosim - Core eval framework 🤖 Agent-CE - 4 pre-integrated agents (Browser Use, Notte, Anthropic CUA, OpenAI CUA) (links in comments ⬇️) This infrastructure handles everything from agent execution to automated evaluation, making it easy to: ✅ Run reproducible agent benchmarks in isolated containers ✅ Compare agent performance across episodes ✅ Deploy on GCP Cloud Run at scale ✅ Get automated LLM-based scoring and feedback Production-tested infrastructure, same that we use with our clients & now available for everyone. 📜 MIT licensed. Huge thanks to the team who built this: Ashwin Thinnappan @sjashwin_ Vaibhav Gupta Jameel Shahid Mohammed @shahid_0324T Maithili Hebbar #AI #Agents #OpenSource #Evals #ComputerUse #WebAgent 🧵 1/3
2
2
8
168
Paradigm Shift AI retweeted
🚀 Open-sourcing our LLM-as-a-Judge for computer-use agents! Built at @ParadigmShiftAI, this system evaluates web browsing agents using visual grounding (screenshots trajectories) to provide: ✅ 0-100 scoring with detailed reasoning ✅ 20 specific error categories ✅ Actionable improvement tips ✅ Support for GPT-5, GPT-5-mini, and Gemini As agentic systems move to production, reliable automated evaluation is critical. This judge provides the grounded feedback needed to actually improve agent performance at scale. MIT licensed and ready to use: github.com/anaishowland/llm-… Includes examples, Docker support, and full documentation. Excited to see what the community builds with it! #OpenSource #AI #AgenticAI #ComputerUse #LLM
2
4
321
Paradigm Shift AI retweeted
🚀 Open-sourced my dataset creation toolkit for web agent benchmarks ✨ Features: • Multi-LLM support (OpenAI, Claude, Gemini) • Auto website analysis • Task generation categorization • Production-ready, MIT licensed Built @ParadigmShiftAI, now available for researchers or AI engineers looking to create benchmarks, training datasets, or evaluating web agents. github.com/anaishowland/data…
1
4
193
Paradigm Shift AI retweeted
🚀 Open-sourcing Captr, a screen and interaction recorder purpose-built for creating datasets, training and evaluating computer-use agents, and RL. 🎥 It captures screen video, mouse/keyboard events, DOM snapshots, accessibility trees, and metadata. macOS: github.com/anaishowland/Capt… Windows: github.com/anaishowland/Capt…
1
2
5
148
Paradigm Shift AI retweeted
🏁 WebVoyager Leaderboard for @browser_use Agent 📊 🚀 All evals were run end-to-end on @ParadigmShiftAI infra with the same configs across models. 1 episode per task. We removed 69 impossible WebVoyager tasks, leaving 574 total. 🔸 GPT-5 stays on top 🥇, cheaper 💸 than other premium models but 35% slower ⏱️ than runner-up Claude 🔸 Claude 4 Sonnet is reliably strong 🥈 but the priciest 💰, 44% higher than Gemini 🔸 Gemini 2.5 Pro often overclaims as our judge flags more false positives vs others. Fastest premium model 💨, 17% quicker than Claude 🔸 Gemini 2.5 Flash is blazing fast 💨 and cheap 💸 but less reliable Sites causing issues: ⚠️ Cambridge Dictionary: Claude 19%, Gemini Flash 19%, Gemini Pro 53%, GPT-5 56%. CAPTCHA is the blocker, Claude and Flash can’t get past it ⚠️ Google Search: Gemini Flash 7%, Claude 51%, Gemini Pro 56%. Mostly CAPTCHA. GPT-5 hits 79% and doesn’t seem affected ⚠️ Booking: GPT-5 54%, Gemini Flash 54%, Gemini Pro 62%. Claude shines at 85% ✅ Top performers: GitHub and arXiv, near-perfect across all providers 👉 Let me know which model, agent or benchmark you want to see next! 📩 DM me for the full results
1
4
583
Paradigm Shift AI retweeted
🏁 Online-Mind2Web Model Leaderboard for @browser_use Agent 📊 🚀All evals were ran end-to-end on @ParadigmShiftAI infra with same configs across all models. 2 to 5 episodes per model, averaged metrics, no best-of-N cherry picking. TL;DR: the model you choose to run a web agent matters a lot and it changes by workflow and by site. There is no one-size-fits-all. GPT-5 🔸 Best performance, 79% 🥇 🔸 Weak on shopping tasks: 43% overall (37% on Macy’s website!) 🔸 Near perfect on health sites like Healthline, Drugs. com, Mayo Clinic 🔸 Slow ⏱️ Claude 4 Sonnet 🔸 Close second on performance, 77% 🥈 🔸 Strongest shopper: 62% overall (73% on Macy’s website) 🔸 Most expensive 💸 Gemini 2.5 Pro 🔸 Solid on simpler flows 🔸 Struggles on shopping at 45% (37% success rate on Macy’s website too) and form filling at 46% Gemini 2.5 Flash 🔸 Fast ⏱️ 🔸 Don’t use for shopping or form filling workflows, performance in the low 30s on those workflows There is no single winner. Route by workflow and by site. Pay for reliability when the task is high interaction. Use faster cheaper models for straightforward flows. 📩 DM me if you want to see the full evaluation results or let me know which model or agent you want to see next. #AIagents #WebAgents #LLM #Evaluation #Mind2Web #Benchmarking #ParadigmShiftAI
1
7
205
Paradigm Shift AI retweeted
Computer-use agents are advancing fast! Reliability is the gate. At @ParadigmShiftAI we add the missing layer: continuous evals observability on sandboxed infra, with thousands of task runs and CI hooks to help teams ship dependable agents 🚀
28 Aug 2025
Computer use is the next step towards true agentic coworkers. Models that can click, type, and reason across the existing software humans use will work like magic. Computer-using agents will actually provide end-to-end automation across legacy and modern tools alike: navigating UIs, logging in, and sending files. The agents that win will slot in where a human can today, without IT overhauls or custom integrations. Excellent deep dive from @zephratic, @stuffyokodraws, @seema_amble, and @JenniferHli:
1
3
113
Paradigm Shift AI retweeted
We tested @browser_use on 2 benchmarks across different models on @ParadigmShiftAI: a random sample of 100 tasks and a set of 100 interaction-heavy tasks (multiple episodes). Also compared against Anthropic CUA and OpenAI CUA (accessed via API). The charts clearly show: 🔹GPT-5 and Claude lead on reliability but GPT-5 is about 2x slower than others (not worth it IMO) 🔹Gemini 2.5 Flash/Pro deliver the best speed and price on easier flows 🔹On the hardest interaction-heavy tasks, Claude 4 Sonnet performs the best Maybe routing by task difficulty is the way to go: use a premium model for high-interaction workflows where reliability is non-negotiable, and use Gemini Flash or a smaller model for straightforward tasks where speed and cost matter. All evals ran end-to-end on the @ParadigmShiftAI platform & infra. Want to test your agent with different models & benchmarks? DM me. Full results in the thread ⬇️
1
1
3
122
Paradigm Shift AI retweeted
Benchmark hacking is real! An agent can hit “90%” performance on any benchmark by cherry-picking results across runs. In the real world you only get a few tries at a task. Let me show you in action: I ran ~1k tasks x 10 episodes with @browser_use on Gemini 2.5 Flash. Per-episode mean was ~70%, but if you take the best score per task across all 10 episodes it jumps to ~90%. Looks great on a slide, but not realistic. With @ParadigmShiftAI, you can measure what actually matters: reliability across episodes ( a lot more). Same tasks, same infra, same rules. No gaming. Full eval results and methodology in the comments. If you ship agents, this will change how you read “benchmark SOTA”.
1
4
10
129,638
Paradigm Shift AI just supercharged web-agent evals 🚀 We revamped our analytics with deeper agent insights, success heatmaps, variance scores, human baselines, full replay & crash logs and more. See where your agent shines or stumbles all in one place. Want access to the platform to test your agent? DM us. Blog ➜ paradigm-shift.ai/blog/agent…
3
8
36
668,349
Paradigm Shift AI retweeted
Ran @browser_use on @ParadigmShiftAI to pit Claude 4 Sonnet vs Gemini 2.5 Pro on 10x10 WebVoyager vision tasks. Claude: 99 % accuracy & 3× faster ⚡️ Gemini: 75 % accuracy 😬 @GoogleDeepMind why the lag? #AI #VisionAI
3
12
2,186
Track browser-eval progress in real time, episode by episode and right from your dashboard! No more hunting through live logs (unless you still get a kick out of it 😅)
2
3
243
More news & insights to share soon 🔥
Ran a web-agent evaluation on 5k tasks in one pass with @ParadigmShiftAI, our biggest batch yet! Planning to 2x capacity each week and aiming for a 100K-task eval in a few weeks. Stay tuned, more insights coming! 🔥
1
4
115
Paradigm Shift AI retweeted
Totally agree, great analysis. That’s why @ParadigmShiftAI delivers richer metrics, deeper failure-trace analytics, and a bigger task bank (proprietary public) to really stress-test web agents
Existing AI Agent benchmarks are broken 🤖💔 Great work by @maxYuxuanZhu and @ddkang identify fix issues, and establish rigorous best practices for Agentic AI benchmarks! Check out the blog: ddkang.substack.com/p/ai-age…
1
3
255
Thrilled to announce we've been accepted into the @UofBeta Pre-Acceleration Program Cohort 10! Looking forward to connecting, learning, and growing alongside other incredible founders.
2
3
122
Introducing NeuroSim, our browser agent evaluation platform! Run real-world evaluations for browser agents models, see gap-to-human scores, share team leaderboards—free while we iterate with you. Read more 👉 paradigm-shift.ai/blog/neuro… DM or email info@paradigm-shift.ai for access. #AI #LLM #Evals

ALT browser use agent, desktop agent, computer use agent, agent evals, agent evaluations, simulation platform, human-computer interraction data, eval analytics, LLM-as-a-judge, reinforcement learning, RL, LoRa, training, GenAI, AI, AI Agent, Paradigm Shift AI, VMs, virtual machine, OpenAI, Gemini, Anthropic, Claude, episodes, task pipeline, eval metrics

1
6
33
48,281
o3 just got 80% cheaper (thanks @OpenAI), so we added it. NeuroSim supports o3, run your browser-use agent evals on Paradigm Shift AI and see how they stack up!
10 Jun 2025
we dropped the price of o3 by 80%!! excited to see what people will do with it now. think you'll also be happy with o3-pro pricing for the performance :)
4
1,313