New evals are badly needed for next gen LLMs !
New paper! SimpleQA is newly open-sourced factuality benchmark that contains 4,326 short, fact-seeking questions that are challenging for frontier models.
Designing good evals is hard. But we used the following criteria:
- High correctness via robust data quality verification / human agreement rates.
- Good researcher UX. Easy to grade, easy to run.
- Challenging for frontier models. GPT-4o and Claude both score less than 50%
- Diversity. SimpleQA contains questions from a wide range of topics, including history, science & technology, art, geography, TV shows, etc.