“The report also showed how the model ignored requests to follow step-by-step reasoning, and it was less likely to generate code that ran without modifications.”
Chat-GPT entering its toddler phase
Yes, GPT-4 seems to be getting worse.
But now we have new information. And well, it's complicated.
Yesterday, I posted about a study showing that GPT-4 success rate deciding whether a number is prime went from 97.6% in March to 2.4% in June.
The report also showed how the model ignored requests to follow step-by-step reasoning, and it was less likely to generate code that ran without modifications.
Hundreds of people replied with their anecdotes. The overwhelming consensus is that GPT-4 is considerably less capable than before.
But the study that started the conversation is misleading.
They used a dataset of 500 problems and had the model figure out whether a given number was prime. The latest GPT-4 version did much worse than the one from a few months ago, with only 12 correct answers out of 500.
But there was an issue:
Every one of the 500 integers used in the study was a prime number! They never tested composite numbers.
So what happens when you make the same comparison with composite and prime numbers?
It turns out that March's GPT-4 is as bad as the June version! In March, GPT-4 answered that most numbers were prime, while the June version answered that most were composite. Since the team behind the study only tested prime numbers, they concluded that GPT-4 is now much worse at determining primality, but that's not the case.
Okay, so where do we stand?
Funny enough, the apparent conclusion is that GPT-4 sucks at finding whether a number is prime. It didn't get worse; it was never good at it.
There's still, however, a large unanswered issue related to the inability of developers to trust these models. We still don't know why the sudden change in behavior between March and June since OpenAI has firmly denied they have changed the model.
What's next?
OpenAI acknowledged the behavior change, and they are investigating. I hope they publish an explanation behind the drift. I'm also looking forward to a proper versioning system that developers can trust and rely on.
This finding doesn't change the overall sentiment from people who overwhelmingly believe the model has worsened. Could this be confirmation bias? Could the honeymoon phase with Large Language Models be over, and people start finding the real problems when building actual applications?
What do you think it's going on here?