Traditionally in ML, building models is the central activity and evaluation is a bit of an afterthought. But the story of ML over the last decade is that models are more general-purpose and more capable. General purpose means you build once but have to evaluate everywhere. Increasing capability means taking on more realistic tasks in higher-stakes domains, so benchmarks have to be far more complex and thoughtful, and in many cases even the most careful benchmarks simply aren't enough. So both the quantity and quality of evaluations has to increase.
But status hierarchies change slowly in any research field, including ML. Most researchers' dream is to build the next transformer. That's a lottery with incredibly low odds. I suspect that researchers who focus on evaluation and understanding will have a much easier time making impactful contributions and standing out, despite the traditionally lower status of this type of work. (Obviously oversimplifying a bit to make a point; there's a lot more to ML than model building and evaluation.)