1/5 The recent LMSYS study on chatbot rankings shows why we shouldn't rely too heavily on any single metric for language models. They found significant shifts in rankings when controlling for "style" vs "substance" in responses.
Does style matter over substance in Arena? Can models "game" human preference through lengthy and well-formatted responses?
Today, we're launching style control in our regression model for Chatbot Arena — our first step in separating the impact of style from substance in rankings.
Highlights:
- GPT-4o-mini, Grok-2-mini drop below most frontier models when style is controlled
- Claude 3.5 Sonnet, Opus, and Llama-3.1-405B rise significantly
- In Hard Prompts, Claude 3.5 Sonnet ties for #1 with ChatGPT-4o-latest. Llama-405B climbs to joint #3.
More analysis in the thread below👇