DPO pushed baguettotron so far into unreadable experimental land that I didn't like it
however skipping straight from SFT to GRPO is producing moments that make me forget that this model is only 371M params
GRPO w mostly format reward (</think>, title, length), a huge repetition penalty, and 20% aesthetic reward from the aforementioned reward model
baguettotron poetry llm experiments complete and to come:
- train baguettotron bradley-terry reward model on 10k kimi vs gemma 3n poems (failed, look at data, reward hacking formatting quirks)
- sft baguettotron on 10k kimi poems and reverse-engineered SYNTH reasoning traces (worked)
- train baguettotron RM on 38k preference pairs (13k poems ranked by claude agents to match personal aesthetic (good enough?: 82% accuracy on validation set)
up next:
- redo SFT run with 8k-9k poems instead of 10k
- use 1-2k kimi poems and traces in a DPO run against broken baguettotron outputs to stabilize reasoning format and length (inspired by olmo 3)
- GRPO with the preference RM and a few other verifiables (trace format, length, repetition penalty)
stretch:
- once formatting / reasoning is stabilized, exp with self-play reward loop
- ablations (money / time permitting)