OpenCompass

OpenCompass

Users
Tweets

OpenCompass

@OpenCompassX

4 Mar 2025

🥳#CodeCriticBench assesses LLMs' critiquing ability in code generation and QA tasks. Covering 10 criteria, it features a 4.3k-samples dataset with three difficulty levels and balanced distribution. 😉CodeCriticBench is now part of the #CompassHub! 😚Feel free to download and explore it. 🤗hub.opencompass.org.cn/datas…

308

Felix

Felix @Togacitygrinds

26 Feb 2025

Replying to @ChenchenZhang_1

Cool benchmarks, but real-world code isn’t a leaderboard. CodeCriticBench sounds solid for stress-testing AI, but don’t forget human review still catches what LLMs miss. Also, try Grok 3, Qodo AI, and Tabnine, because AI critiques are only as good as the models behind them.

Chenchen Zhang

Chenchen Zhang @ChenchenZhang_1

26 Feb 2025

💥 CodeCriticBench: The Ultimate LLM Code Critique Test! 🚀 Tests code gen & QA (CodeForces, MBPP, StackOverflow) ✔️ 10 fine-grained criteria for deep evaluation 🤖 Auto human review 💡 Helps find stronger Critic models, boosting CODE-RL-Scaling & overcoming sandbox limits! 🚀

2,014

AI Native Foundation

AI Native Foundation

@AINativeF

26 Feb 2025

7. CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models 🔑 Keywords: Large Language Models, code critique benchmark, reasoning abilities, CodeCriticBench, evaluation protocols 💡 Category: Natural Language Processing 🌟 Research Objective: To evaluate and improve the critique capacity of Large Language Models (LLMs) through a new benchmark, addressing limitations in existing critique benchmarks. 🛠️ Research Methods: Development of CodeCriticBench, which includes comprehensive assessments for code generation and code QA tasks of varying difficulty, supported by detailed evaluation protocols. 💬 Research Conclusions: - Existing benchmarks are limited by focusing primarily on general reasoning tasks and not adequately addressing code tasks. - CodeCriticBench offers a holistic evaluation through basic and advanced critique methods, with fine-grained checklists for in-depth analysis. - Experimental results demonstrate the effectiveness of CodeCriticBench in enhancing LLM critique capabilities. 👉 Paper link: huggingface.co/papers/2502.1…

AI Native Foundation

AI Native Foundation

@AINativeF

26 Feb 2025

📚 AI Native Daily Paper Digest - 20250225 🌟 Follow @AINativeF for the latest insights on AI Native. Covering AI research papers from Hugging Face, featured in the image. 💡 Stay updated with the latest research trends and dive deep into the future of AI! 🚀 #AI #HuggingFace #AIPaper #AINative #AINF — Appendix: Today's AI research papers — 1. Thus Spake Long-Context Large Language Model 2. VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing 3. DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks 4. Slamming: Training a Speech Language Model on One GPU in a Day 5. Audio-FLAN: A Preliminary Release 6. GCC: Generative Color Constancy via Diffusing a Color Checker 7. CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models 8. Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment 9. Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning 10. Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models 11. Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam 12. RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers 13. Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration

174