When things aren’t working or it doesn’t reach the 100% of intended behavior.
Majority of people don’t think of running evals, which will burn tokens as well, and most aren’t even prepared to build one, and they don’t know how to verify evals.
Then typically when the model reached 100% of its target on a particular task. Most people will push for more, 110% 120% and the incremental improvements between them can’t be explained if you don’t understand how transformers work or whatever architecture behind it (which is closed weights anyway)
If you open weights you can demonstrate at a lower level. But what most people see is magic tokens being chained around tool calls