spent a week benchmarking code review tools on the same dataset (166 golden comments from human reviewers, 20 PRs, 9 languages). some findings:
1. triage > raw LLM. pointing GPT-5.2 at every line of a diff gives you noise. first figure out which entities changed, who depends on them, and how risky the change is, then point the LLM at just those 25 entities. recall goes up. precision goes up. the graph answers what the LLM can't: "how many things break if this changes?" deterministic, no hallucination, zero cost.
2. the coverage problem is real. on a 40-file PR, risk-based triage picked entities from src/Mod/Draft/ and src/Mod/BIM/. all 19 human comments were in src/Mod/TechDraw/, which had lower risk scores. a second pass on uncovered files fixed it. recall jumped from 36% to 39%.
3. dedup matters more than prompt engineering. tried 10 prompt variations. the one that actually moved precision from 17% to 25% was identifier-aware dedup: if two findings share 2 code identifiers, merge them. boring but effective.
4. more context is not always better. going from 3000 to 5000 chars of context per entity made things worse. the LLM found more things to comment on, most of them wrong.
5. model choice matters less than you'd think. GPT-5.2 was best. Sonnet 4.6 was most precise but too conservative. the triage dedup pipeline matters more than which model sits in the middle.