Step 5: Normalize traces with a strong LLM (few-shot examples from validated data) into training format: goal, state recent trajectory, decision point, rejected vs chosen action rationale, expected result, outcome label, and suggested next decision. Spot-check a small random sample of normalized traces before inclusion. Store both SFT examples and preference pairs.
Step 6: Hold out a stratified set of decision traces for offline mid-task judgment evals. Additionally maintain a small set of live interactive tasks for periodic human preference comparisons.
Step 7: Fine-tune or preference-tune on the curated data, emphasizing mid-trajectory decisions under partial information.
Step 8: Measure progress via reduced intervention rate (overall and per task difficulty tier), decision eval performance, end-to-end task quality, and average expert review time per intervention. Iterate data collection continuously with the improved agent.