Our method's effectiveness and efficiency relies on learning, i.e. internalizing lessons from experience into the model, not only iterating on model output artifacts such as code.
In fact, differently from many recent approaches, we do not use any code execution at all.
Learning in this case where we don't have access to the model's internals, and where the horizon is relatively short, is achieved by simply conditioning the model on the history of its past attempts and their outcomes.
Learning and reasoning is grounded on the real task environment: hypotheses are tested on the grid, and feedback comes from the world, not just from the model's own reasoning trace.
The core of our approach is a simple ReAct loop: We separate reasoning/planning from validating/acting: a dedicated Reasoner learns from full history to improve its natural-language instructions which it provides to a dedicated Validator that executes, validates, and returns feedback to the Reasoner.
The image shows an example problem along with the natural language solution that the Reasoner hypothesized and provided to the Validator. The Reasoner describes it as "Lasers" that "shoot inwards"
The meta-cognitive capabilities of the new Gemini model are also critical as they help decide when the (learning) process can be stopped.