We benchmarked Opus 4.5, Sonnet 4.5, and Gemini 3 Pro on research tasks at Elicit - extracting answers from papers and writing systematic review reports. Results were pretty clear:
*QA from papers:* Opus 4.5 dominates. 96.5% accuracy vs Gemini's 89.4%. Opus is also best on our combined "accurate supported direct" metric (76% vs 71%). Gemini is slightly better on claim supportedness
*Report writing:* Opus 4.5 produces significantly better-supported reports than Sonnet 4.5, the previous best model for this task:
- 62% of claims well-supported vs Sonnet's 54%
- 31% poorly-supported vs Sonnet's 40%
Opus is less verbose and writes ~20% fewer claims per report. We didn't bother comparing to Gemini since Sonnet 4.5 already wins 75% of head-to-head comparisons vs Gemini, and Gemini is 6x slower than Sonnet
Qualitatively, in a manual screen of 5 reports,
@PradyuPrasad found that Opus and Sonnet reach the same conclusions with no dramatic differences in output. Sonnet just writes much longer reports with more extensive commentary by default
Opus still has stability issues at scale - we hit a bunch of 529 errors during testing. But once reliability improves, Opus 4.5 looks like the new default for accuracy-critical research workflows