Anaïs Howland

Anaïs Howland

8 Photos and videos

Tweets

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

17 Nov 2025

Over the past month I have been open sourcing most of what we built @ParadigmShiftAI . Today I am officially sharing that I have shut the company down and fully wound down operations. I am very proud of what we built in a short time: a real product for large scale computer use evaluations and infra to run agents and benchmarks at scale. This was not an easy decision. @ParadigmShiftAI was accepted into a top startup accelerator and had strong interest from parts of the agent ecosystem, but the risk, market timing, and where I have the deepest edge were not aligned with how I want to spend the next several years. I am grateful to everyone who gave feedback, tried the product, or took a bet on us. All of our work is now MIT open source for anyone to use or build on, and I will share the full list of repos and the dataset in the comments. Going forward my energy is moving back to a domain I know deeply. I am now building a new fintech x AI startup in stealth, better aligned with my prior experience as a fixed income portfolio manager before tech. If you are working in agents, evals, or fintech and want to jam on ideas or hear more about what I am building next, my DMs are open.

156

Anaïs Howland

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

30 Oct 2025

🚀 Excited to announce the release of our computer-use dataset! 3,100 high-quality computer & browser tasks including videos, application logs, DOMs, screenshots, metadata - now open source on @huggingface Perfect for or RL or training AI agents that can use computers like humans 🤖💻 🔗 huggingface.co/datasets/anai… 🧵 Thread with details ⬇️

182

Anaïs Howland

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

29 Oct 2025

🚀 Open sourcing @ParadigmShiftAI's computer-use agent evaluation infrastructure! 2 repos now available: 🔬 Neurosim - Core eval framework 🤖 Agent-CE - 4 pre-integrated agents (Browser Use, Notte, Anthropic CUA, OpenAI CUA) (links in comments ⬇️) This infrastructure handles everything from agent execution to automated evaluation, making it easy to: ✅ Run reproducible agent benchmarks in isolated containers ✅ Compare agent performance across episodes ✅ Deploy on GCP Cloud Run at scale ✅ Get automated LLM-based scoring and feedback Production-tested infrastructure, same that we use with our clients & now available for everyone. 📜 MIT licensed. Huge thanks to the team who built this: Ashwin Thinnappan @sjashwin_ Vaibhav Gupta Jameel Shahid Mohammed @shahid_0324T Maithili Hebbar #AI #Agents #OpenSource #Evals #ComputerUse #WebAgent 🧵 1/3

168

Anaïs Howland

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

28 Oct 2025

🚀 Open-sourcing our LLM-as-a-Judge for computer-use agents! Built at @ParadigmShiftAI, this system evaluates web browsing agents using visual grounding (screenshots trajectories) to provide: ✅ 0-100 scoring with detailed reasoning ✅ 20 specific error categories ✅ Actionable improvement tips ✅ Support for GPT-5, GPT-5-mini, and Gemini As agentic systems move to production, reliable automated evaluation is critical. This judge provides the grounded feedback needed to actually improve agent performance at scale. MIT licensed and ready to use: github.com/anaishowland/llm-… Includes examples, Docker support, and full documentation. Excited to see what the community builds with it! #OpenSource #AI #AgenticAI #ComputerUse #LLM

321

Anaïs Howland

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

15 Oct 2025

🚀 Open-sourced my dataset creation toolkit for web agent benchmarks ✨ Features: • Multi-LLM support (OpenAI, Claude, Gemini) • Auto website analysis • Task generation categorization • Production-ready, MIT licensed Built @ParadigmShiftAI, now available for researchers or AI engineers looking to create benchmarks, training datasets, or evaluating web agents. github.com/anaishowland/data…

193

Anaïs Howland

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

14 Oct 2025

🚀 Open-sourcing Captr, a screen and interaction recorder purpose-built for creating datasets, training and evaluating computer-use agents, and RL. 🎥 It captures screen video, mouse/keyboard events, DOM snapshots, accessibility trees, and metadata. macOS: github.com/anaishowland/Capt… Windows: github.com/anaishowland/Capt…

GitHub - anaishowland/Captr_MacOS: Screen recording and computer interaction capture tool that...

Screen recording and computer interaction capture tool that records keyboard/mouse input, screen video, DOM snapshots, and accessibility trees. Perfect for creating datasets to train and evaluate c...

github.com

148

Anaïs Howland

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

23 Sep 2025

🏁 WebVoyager Leaderboard for @browser_use Agent 📊 🚀 All evals were run end-to-end on @ParadigmShiftAI infra with the same configs across models. 1 episode per task. We removed 69 impossible WebVoyager tasks, leaving 574 total. 🔸 GPT-5 stays on top 🥇, cheaper 💸 than other premium models but 35% slower ⏱️ than runner-up Claude 🔸 Claude 4 Sonnet is reliably strong 🥈 but the priciest 💰, 44% higher than Gemini 🔸 Gemini 2.5 Pro often overclaims as our judge flags more false positives vs others. Fastest premium model 💨, 17% quicker than Claude 🔸 Gemini 2.5 Flash is blazing fast 💨 and cheap 💸 but less reliable Sites causing issues: ⚠️ Cambridge Dictionary: Claude 19%, Gemini Flash 19%, Gemini Pro 53%, GPT-5 56%. CAPTCHA is the blocker, Claude and Flash can’t get past it ⚠️ Google Search: Gemini Flash 7%, Claude 51%, Gemini Pro 56%. Mostly CAPTCHA. GPT-5 hits 79% and doesn’t seem affected ⚠️ Booking: GPT-5 54%, Gemini Flash 54%, Gemini Pro 62%. Claude shines at 85% ✅ Top performers: GitHub and arXiv, near-perfect across all providers 👉 Let me know which model, agent or benchmark you want to see next! 📩 DM me for the full results

Bar chart titled ‘WebVoyager model leaderboard – ranked by LLM evaluator performance, Sep 20, 2025.’ Y-axis: LLM evaluator score. Four bars: GPT-5 78, Claude 4 Sonnet 77, Gemini 2.5 Pro 73, Gemini 2.5 Flash 67. GPT-5 leads by 1 point over Claude; Flash trails. Purple bars with model logos above each.

ALT Bar chart titled ‘WebVoyager model leaderboard – ranked by LLM evaluator performance, Sep 20, 2025.’ Y-axis: LLM evaluator score. Four bars: GPT-5 78, Claude 4 Sonnet 77, Gemini 2.5 Pro 73, Gemini 2.5 Flash 67. GPT-5 leads by 1 point over Claude; Flash trails. Purple bars with model logos above each.

Table comparing Browser Use agent across models with rank, performance, steps, time, tokens, and price per task.
1. GPT-5: 82% self-reported, 78% LLM evaluator, 9.4 steps, 4.7 min, 55,226 input tokens, 9,233 output tokens, $0.16.
2. Claude 4 Sonnet: 80%, 77%, 11.2 steps, 3.5 min, 57,143 input, 5,923 output, $0.26.
3. Gemini 2.5 Pro: 82%, 73%, 13 steps, 3.0 min, 82,578 input, 7,491 output, $0.18.
4. Gemini 2.5 Flash: 70%, 67%, 16.3 steps, 2.7 min, 116,551 input, 9,059 output, $0.06.
Takeaway: GPT-5 has the top score, Claude is close but most expensive, Flash is fastest and cheapest with lower reliability

ALT Table comparing Browser Use agent across models with rank, performance, steps, time, tokens, and price per task. 1. GPT-5: 82% self-reported, 78% LLM evaluator, 9.4 steps, 4.7 min, 55,226 input tokens, 9,233 output tokens, $0.16. 2. Claude 4 Sonnet: 80%, 77%, 11.2 steps, 3.5 min, 57,143 input, 5,923 output, $0.26. 3. Gemini 2.5 Pro: 82%, 73%, 13 steps, 3.0 min, 82,578 input, 7,491 output, $0.18. 4. Gemini 2.5 Flash: 70%, 67%, 16.3 steps, 2.7 min, 116,551 input, 9,059 output, $0.06. Takeaway: GPT-5 has the top score, Claude is close but most expensive, Flash is fastest and cheapest with lower reliability

583

Anaïs Howland

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

16 Sep 2025

🏁 Online-Mind2Web Model Leaderboard for @browser_use Agent 📊 🚀All evals were ran end-to-end on @ParadigmShiftAI infra with same configs across all models. 2 to 5 episodes per model, averaged metrics, no best-of-N cherry picking. TL;DR: the model you choose to run a web agent matters a lot and it changes by workflow and by site. There is no one-size-fits-all. GPT-5 🔸 Best performance, 79% 🥇 🔸 Weak on shopping tasks: 43% overall (37% on Macy’s website!) 🔸 Near perfect on health sites like Healthline, Drugs. com, Mayo Clinic 🔸 Slow ⏱️ Claude 4 Sonnet 🔸 Close second on performance, 77% 🥈 🔸 Strongest shopper: 62% overall (73% on Macy’s website) 🔸 Most expensive 💸 Gemini 2.5 Pro 🔸 Solid on simpler flows 🔸 Struggles on shopping at 45% (37% success rate on Macy’s website too) and form filling at 46% Gemini 2.5 Flash 🔸 Fast ⏱️ 🔸 Don’t use for shopping or form filling workflows, performance in the low 30s on those workflows There is no single winner. Route by workflow and by site. Pay for reliability when the task is high interaction. Use faster cheaper models for straightforward flows. 📩 DM me if you want to see the full evaluation results or let me know which model or agent you want to see next. #AIagents #WebAgents #LLM #Evaluation #Mind2Web #Benchmarking #ParadigmShiftAI

Vertical bar chart titled Online-Mind2Web - Model Leaderboard. X-axis shows models GPT-5, Claude 4 Sonnet, Gemini 2.5 Pro, Gemini 2.5 Flash. Y-axis shows score. Bars from highest to lowest: GPT-5 79, Claude 4 Sonnet 77, Gemini 2.5 Pro 70, Gemini 2.5 Flash 59. Subtitle says ranked by score high to low. Runs used the Browser Use agent, 2 to 5 episodes per task, on Paradigm Shift AI infrastructure.

ALT Vertical bar chart titled Online-Mind2Web - Model Leaderboard. X-axis shows models GPT-5, Claude 4 Sonnet, Gemini 2.5 Pro, Gemini 2.5 Flash. Y-axis shows score. Bars from highest to lowest: GPT-5 79, Claude 4 Sonnet 77, Gemini 2.5 Pro 70, Gemini 2.5 Flash 59. Subtitle says ranked by score high to low. Runs used the Browser Use agent, 2 to 5 episodes per task, on Paradigm Shift AI infrastructure.

0, 4 Gemini 2.5 Flash 59. Remaining rows list other evaluated models with their scores. All runs used the Browser Use agent, averaged over 2 to 5 episodes per task, on Paradigm Shift AI infrastructure.

ALT 0, 4 Gemini 2.5 Flash 59. Remaining rows list other evaluated models with their scores. All runs used the Browser Use agent, averaged over 2 to 5 episodes per task, on Paradigm Shift AI infrastructure.

205

Anaïs Howland

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

16 Sep 2025

Computer-use agents are advancing fast! Reliability is the gate. At @ParadigmShiftAI we add the missing layer: continuous evals observability on sandboxed infra, with thousands of task runs and CI hooks to help teams ship dependable agents 🚀

a16z

@a16z

28 Aug 2025

Computer use is the next step towards true agentic coworkers. Models that can click, type, and reason across the existing software humans use will work like magic. Computer-using agents will actually provide end-to-end automation across legacy and modern tools alike: navigating UIs, logging in, and sending files. The agents that win will slot in where a human can today, without IT overhauls or custom integrations. Excellent deep dive from @zephratic, @stuffyokodraws, @seema_amble, and @JenniferHli:

113

Anaïs Howland

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

13 Sep 2025

We tested @browser_use on 2 benchmarks across different models on @ParadigmShiftAI: a random sample of 100 tasks and a set of 100 interaction-heavy tasks (multiple episodes). Also compared against Anthropic CUA and OpenAI CUA (accessed via API). The charts clearly show: 🔹GPT-5 and Claude lead on reliability but GPT-5 is about 2x slower than others (not worth it IMO) 🔹Gemini 2.5 Flash/Pro deliver the best speed and price on easier flows 🔹On the hardest interaction-heavy tasks, Claude 4 Sonnet performs the best Maybe routing by task difficulty is the way to go: use a premium model for high-interaction workflows where reliability is non-negotiable, and use Gemini Flash or a smaller model for straightforward tasks where speed and cost matter. All evals ran end-to-end on the @ParadigmShiftAI platform & infra. Want to test your agent with different models & benchmarks? DM me. Full results in the thread ⬇️

122

Anaïs Howland

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

4 Sep 2025

Benchmark hacking is real! An agent can hit “90%” performance on any benchmark by cherry-picking results across runs. In the real world you only get a few tries at a task. Let me show you in action: I ran ~1k tasks x 10 episodes with @browser_use on Gemini 2.5 Flash. Per-episode mean was ~70%, but if you take the best score per task across all 10 episodes it jumps to ~90%. Looks great on a slide, but not realistic. With @ParadigmShiftAI, you can measure what actually matters: reliability across episodes ( a lot more). Same tasks, same infra, same rules. No gaming. Full eval results and methodology in the comments. If you ship agents, this will change how you read “benchmark SOTA”.

Evaluation dashboard comparing per-episode pass rates for ~1k web tasks over 10 episodes using BrowserUse with Gemini 2.5 Flash. Bar chart shows ~70% mean per episode. A separate ‘max across episodes’ line indicates ~90% if cherry-picking best runs, highlighting the gap between pass@1 and best-of-10.

ALT Evaluation dashboard comparing per-episode pass rates for ~1k web tasks over 10 episodes using BrowserUse with Gemini 2.5 Flash. Bar chart shows ~70% mean per episode. A separate ‘max across episodes’ line indicates ~90% if cherry-picking best runs, highlighting the gap between pass@1 and best-of-10.

129,638

Paradigm Shift AI

Paradigm Shift AI

@ParadigmShiftAI

24 Jul 2025

Paradigm Shift AI just supercharged web-agent evals 🚀 We revamped our analytics with deeper agent insights, success heatmaps, variance scores, human baselines, full replay & crash logs and more. See where your agent shines or stumbles all in one place. Want access to the platform to test your agent? DM us. Blog ➜ paradigm-shift.ai/blog/agent…

668,349

Anaïs Howland

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

23 Jul 2025

Ran @browser_use on @ParadigmShiftAI to pit Claude 4 Sonnet vs Gemini 2.5 Pro on 10x10 WebVoyager vision tasks. Claude: 99 % accuracy & 3× faster ⚡️ Gemini: 75 % accuracy 😬 @GoogleDeepMind why the lag? #AI #VisionAI

0:19

0:18

2,186

Paradigm Shift AI

Paradigm Shift AI

@ParadigmShiftAI

17 Jul 2025

Track browser-eval progress in real time, episode by episode and right from your dashboard! No more hunting through live logs (unless you still get a kick out of it 😅)

243

Paradigm Shift AI

Paradigm Shift AI

@ParadigmShiftAI

11 Jul 2025

More news & insights to share soon 🔥

Anaïs Howland

@AnaisHowland18

11 Jul 2025

Ran a web-agent evaluation on 5k tasks in one pass with @ParadigmShiftAI, our biggest batch yet! Planning to 2x capacity each week and aiming for a 100K-task eval in a few weeks. Stay tuned, more insights coming! 🔥

115

Anaïs Howland

Paradigm Shift AI retweeted

Anaïs Howland

@AnaisHowland18

9 Jul 2025

Totally agree, great analysis. That’s why @ParadigmShiftAI delivers richer metrics, deeper failure-trace analytics, and a bigger task bank (proprietary public) to really stress-test web agents

Shayne Longpre

@ShayneRedford

8 Jul 2025

Existing AI Agent benchmarks are broken 🤖💔 Great work by @maxYuxuanZhu and @ddkang identify fix issues, and establish rigorous best practices for Agentic AI benchmarks! Check out the blog: ddkang.substack.com/p/ai-age…

255

Paradigm Shift AI

Paradigm Shift AI

@ParadigmShiftAI

7 Jul 2025

Thrilled to announce we've been accepted into the @UofBeta Pre-Acceleration Program Cohort 10! Looking forward to connecting, learning, and growing alongside other incredible founders.

122

Paradigm Shift AI

Paradigm Shift AI

@ParadigmShiftAI

17 Jun 2025

Introducing NeuroSim, our browser agent evaluation platform! Run real-world evaluations for browser agents models, see gap-to-human scores, share team leaderboards—free while we iterate with you. Read more 👉 paradigm-shift.ai/blog/neuro… DM or email info@paradigm-shift.ai for access. #AI #LLM #Evals

ALT browser use agent, desktop agent, computer use agent, agent evals, agent evaluations, simulation platform, human-computer interraction data, eval analytics, LLM-as-a-judge, reinforcement learning, RL, LoRa, training, GenAI, AI, AI Agent, Paradigm Shift AI, VMs, virtual machine, OpenAI, Gemini, Anthropic, Claude, episodes, task pipeline, eval metrics

48,281

Paradigm Shift AI

Paradigm Shift AI

@ParadigmShiftAI

11 Jun 2025

o3 just got 80% cheaper (thanks @OpenAI), so we added it. NeuroSim supports o3, run your browser-use agent evals on Paradigm Shift AI and see how they stack up!

ALT agent evals, evaluation, simulation, openai, o3, o4-mini, browser agent, desktop agent, performance, agent benchmarks

Sam Altman

@sama

10 Jun 2025

we dropped the price of o3 by 80%!! excited to see what people will do with it now. think you'll also be happy with o3-pro pricing for the performance :)

1,313