Is Falcon really better than LLaMA?
Short take: probably not.
Longer take: we reproduced LLaMA 65B eval on MMLU and we got 61.4, close to the official number (63.4), much higher than its Open LLM Leaderboard number (48.8), and clearly higher than Falcon (52.7).
Code and prompt open-sourced at
github.com/FranxYao/chain-of…
No fancy prompting engineering, no fancy decoding, everything by default.
----
Full story:
On OpenLLM Leaderboard (
huggingface.co/spaces/Huggin…), Falcon is the top 1, suppressing LLaMA, and promoted by
@Thom_Wolf (
twitter.com/Thom_Wolf/status…)
Yet later
@karpathy expressed concern about why on Open LLM Leaderboard, the LLaMA 65B score is significantly lower than official (48.8 v.s. 63.4), see
twitter.com/karpathy/status/…
We figure that a simple quick open-sourced evaluation script on LLaMA 65B would clarify, so we just did it
github.com/FranxYao/chain-of…
Again, everything is default, official MMLU prompt, no fancy prompt engineering, no fancy decoding. LLaMA 65B simply can do it. We encourage everyone to try the eval script out.
This result makes us continue to hold the belief that the best bet of open-source community to get close to GPT-3.5 is to do RLHF on LLaMA 65B, per our previous discovery in Chain-of-thought Hub
arxiv.org/abs/2305.17306
Yet we do not intend to raise wars between LLaMA and Falcon -- both are great open-sourced models and have made significant contribution to the field! Falcon also have the advantage of a easier license, which also gives its great potential to be awesome!
🍻🍻
(i've avoided tweeting about falcon so far because of this, not sure about)