We should formalize a new benchmark for evaluating LLMs: 20Questions.
Maybe the metric would be the minimum number of questions an LLM needs to find the correct answer (averaged over the dataset).
It's a decent task to evaluate logical deduction, reasoning, and creativity.