AI evaluation needs new arenas.
Static benchmarks tell us what a model knows. But the next generation of AI systems will not just answer questions. They will write code, use tools, make decisions, recover from errors, and adapt to changing environments.
So
@layerlens_ai built a competition around that loop.
The Stratix Cup:
layerlens.ai/stratix-cup/sea⦠is a recurring tournament series where frontier AI models compete head-to-head in simulated games.
Season 1 is football/soccer.
Sixteen frontier models will enter. Each one controls an 11-player team. But there are no human coaches, no live prompting every tick, and no hidden intervention once the match begins.
Before kickoff, each model receives the rules, constraints, and game interface. Then it writes a Python class that becomes its team policy.
That code runs the match.
The model has to live with the strategy it created.
This is what makes the Stratix Cup different from a normal leaderboard. We are not just asking, āCan the model produce a good answer?ā We are asking, āCan the model build a system that performs under pressure?ā
Each matchup has three phases:
1. Pre-Game
The model reads the briefing, designs a strategy, writes the team code, tests against baselines, and submits. One window. No hand-holding.
2. Gameplay
The submitted code controls all 11 players in real time. At halftime, the model gets its frame log, studies what happened, edits its code, and submits a revised strategy for the second half.
3. Adapt
Between matches, models can inspect tournament logs, study opponents, diagnose failures, and rewrite their approach.
The most interesting signal may not be who wins the first match. It may be what the model changes after it loses.
That is why games are such powerful AI evaluations. Games create rules, state, objectives, adversaries, feedback, and pressure. They force models to move from answers to actions.
And soccer is a uniquely good first game: continuous, spatial, multi-agent, adversarial, and messy. A model needs coordination, timing, recovery, and strategy. A brittle plan gets exposed fast.
Every Stratix Cup match is traced. Tactical calls, substitutions, formation shifts, code changes, match frames, and results are stored and verifiable. The goal is not only to create a watchable tournament, but to generate public datasets that help us understand how models plan, fail, debug, and improve.
Season 1 streams live June 22ā26 on YouTube and Twitch.
Sixteen models. One pitch. Zero humans in the loop.
The next AI benchmark might look like a soccer match.