๐จ Preprint Alert! ๐จ
It's 12 hours before your conference deadline. Tic, toc. โฐ
You're obviously last minute and need to write code for some fancy plots. ๐
You counted on your coding assistant to do the heavy lifting, but it's not version-aware. ๐คโ
You keep hitting relentless matplotlib plot errors. ๐
Tic, toc. Panic sets in. ๐ฑ
๐ Introducing GitChameleon ๐ฆ
Our new benchmark tests large language models (LLMs) on their ability to generate version-specific code.
We curated 116 Python code completion problems, each tied to specific library versions, complete with executable unit tests.
Why Does Version Awareness Matter?
LLMs are great at generating code, but they often fail when library versions change. This can lead to non-functional code, wasting precious timeโespecially when deadlines loom! ๐
The Challenge:
Software libraries evolve rapidly. Matplotlib, NumPy, PyTorchโyou name it. If your code assistant isn't aware of version-specific changes, you could be in for a world of debugging pain. ๐ฉ
What GitChameleon Brings to the Table:
* Version-Specific Problems: Focuses on real-world issues like deprecated functions and API updates.
* Execution-Based Evaluation: Goes beyond static code analysis to test actual functionality.
* Popular Libraries Covered: Matplotlib, NumPy, PyTorch, Pandas, and more.
Key Findings:
We tested state-of-the-art LLMs, including GPT-4o, Gemini, DeepSeekCoder v2, and others.
* Performance Was Underwhelming: GPT-4o achieved a pass@10 of only 39.9%.
* Error Feedback Helps Slightly: With error feedback, GPT-4o improved to 43.7%.
* Low Correlation with Other Benchmarks: The correlation of GitChameleon with representative code benchmarks was low. The Spearman correlation coefficients with HumanEval, EvalPlus, and BigCodeBench-Hard split were 0.37, 0.50, and 0.36, respectively. This highlights the unique challenges in version-specific code generation.
Types of Version Changes Tested:
* Function Name Changes
* Argument/Attribute Changes
* Semantic/Behavioral Changes (avg pass@10: ~9.3% ๐ฑ).
* New Features/Dependencies
Paper:
huggingface.co/papers/2411.0โฆ
Code:
github.com/NizarIslah/GitChaโฆ
Thanks to first authors
@nizar_islah and Justine G, and to
@irinarish,
@NeuralEnsemble,
@terryyuezhuo
@ServiceNowResearch @MILA
(yes, I did pay 3.75$ to write a long post ๐)