We're hitting the point where LLM evals are going to have to be task based. Models are just too advanced for single prompt text based evaluations.
The Minecraft evals are a good example. I want to see LLMs building apps, creating art, completing office work, controlling robots, training dogs, lol, etc.