making AI work

Joined July 2016
589 Photos and videos
We spoke with over 50 healthcare professionals to understand what they need from AI. The conversations came down to one thing: trust. Our latest healthcare findings: labs.scale.com/blog/healthca…
3
1
12
1,517
Scale AI retweeted
Today we're releasing HiL-Dynamics, the first open-source tool that measures how production agents actually collaborate with humans under uncertainty. Not just whether they got the answer. Now you can measure exactly when your agent asks for help, when it makes assumptions, and when it'll confidently ship the wrong answer. Our findings 🧵
5
9
27
3,827
To understand our story, you have to go back to the beginning. It started with self-driving cars. Ten years later, it's the architecture underneath AI that actually works, across frontier labs, enterprises, governments, and mission-critical systems around the world.
5
12
39
5,464
Scale AI retweeted
The humans stay. That’s the idea behind @scale_ai's new brand campaign. 10 years of building AI has taught us something: the most important decisions belong to humans. The AI that works in decisions of consequence keeps humans at the center. Going live in SF and NYC. Where to next? 👀
3
10
54
8,689
The future runs on proof. 😤
2
6
45
7,593
This month we turn 10. The hard work started in 2016, and it hasn’t stopped. Shortcuts are for losers. Winners welcome. scale.com/careers
9
25
118
56,290
Scale AI retweeted
Today we’re releasing Refactoring, the final leaderboard of our SWE Atlas suite. This new leaderboard is the ultimate test of an agent's ability to restructure code without breaking the system. Claude Opus 4.7 with Claude Code takes the top spot🥇
40
51
679
106,007
Proud to share @CDAODoW has expanded its enterprise agreement with Scale AI raising the ceiling from $100M to $500M. This expansion reflects our continued commitment to accelerating the adoption of AI capabilities across the Pentagon to help America stay prepared, resilient, and strong. scale.com/blog/Scale-ai-pent…
3
5
39
4,060
Scale AI retweeted
AI pretenders vs. AI contenders. It's those who still haven’t realized reliability is the product vs. those who can deliver reliability and outcomes. That's what the enterprise AI race comes down to. Here's a note I sent the Scale team this week.
4
11
46
19,248
Scale AI retweeted
We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask? Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers. We just added GPT-5.5, Opus 4.7, and Kimi K2.6 to the leaderboard. Here’s what we’re seeing ⬇️🧵
31
67
654
79,633
Scale AI has acquired ICG Solutions, a defense technology firm specializing in real-time streaming data analytics. This is another step forward in how we support the U.S. defense and intelligence community with AI systems built to serve America’s most important national security missions. scale.com/blog/scale-acquire…
4
4
30
7,922
New @ScaleAILabs Research: Your AI agent just gave you an answer but did it actually solve the problem, get lucky, or just sound right? Today’s benchmarks can’t tell. We built HiL-Bench (Human-in-Loop Benchmark) to test a critical skill: does your agent know what it’s missing and when to ask for clarification? 🧵
5
13
69
9,010
Key takeaway for model builders: capability and judgment are orthogonal axes. Scaling SWE-Bench alone won't close this. Current post-training doesn’t penalize an agent for confidently solving the wrong problem. Ask-F1 is the first verifiable signal that does, and it transfers across domains. The goal isn't full autonomy. It's selective escalation: agents that know what they don't know.
2
8
2,166