Paper shows how to squeeze more skill from small web‑surfing LLMs without burning absurd compute.
Open agents copy a 70B teacher for a while then switch to reinforcement learning (RL), and that switch timing is the whole game.
Authors trained a 8B student on teacher demos, branched into RL at different checkpoints, and ran 1,370 mixes to see what sticks.
Early but not immediate branching beats pure imitation or pure RL, matching the best imitation score on MiniWoB using 55% of the flops.
The same recipe narrows, but does not close, the WorkArena gap, hinting that hard office workflows still need richer data or bigger brains.
Bootstrapped stats reveal stable knobs: decoding temperature 0.25, zero advantage filtering, grouped advantage, big batch 512, and modest 1e‑6 learning rate.
The playbook gives smaller teams a clear, cheaper path to teach open models reliable multi‑step browser habits.
----
Paper – arxiv. org/abs/2507.04103
Paper Title: "How to Train Your LLM Web Agent: A Statistical Diagnosis"