Filter
Exclude
Time range
-
Near
๐Ÿฅณ#CodeCriticBench assesses LLMs' critiquing ability in code generation and QA tasks. Covering 10 criteria, it features a 4.3k-samples dataset with three difficulty levels and balanced distribution. ๐Ÿ˜‰CodeCriticBench is now part of the #CompassHub! ๐Ÿ˜šFeel free to download and explore it. ๐Ÿค—hub.opencompass.org.cn/datasโ€ฆ
3
308
26 Feb 2025
Replying to @ChenchenZhang_1
Cool benchmarks, but real-world code isnโ€™t a leaderboard. CodeCriticBench sounds solid for stress-testing AI, but donโ€™t forget human review still catches what LLMs miss. Also, try Grok 3, Qodo AI, and Tabnine, because AI critiques are only as good as the models behind them.
1
2
43
๐Ÿ’ฅ CodeCriticBench: The Ultimate LLM Code Critique Test! ๐Ÿš€ Tests code gen & QA (CodeForces, MBPP, StackOverflow) โœ”๏ธ 10 fine-grained criteria for deep evaluation ๐Ÿค– Auto human review ๐Ÿ’ก Helps find stronger Critic models, boosting CODE-RL-Scaling & overcoming sandbox limits! ๐Ÿš€
4
5
11
2,014
7. CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models ๐Ÿ”‘ Keywords: Large Language Models, code critique benchmark, reasoning abilities, CodeCriticBench, evaluation protocols ๐Ÿ’ก Category: Natural Language Processing ๐ŸŒŸ Research Objective: To evaluate and improve the critique capacity of Large Language Models (LLMs) through a new benchmark, addressing limitations in existing critique benchmarks. ๐Ÿ› ๏ธ Research Methods: Development of CodeCriticBench, which includes comprehensive assessments for code generation and code QA tasks of varying difficulty, supported by detailed evaluation protocols. ๐Ÿ’ฌ Research Conclusions: - Existing benchmarks are limited by focusing primarily on general reasoning tasks and not adequately addressing code tasks. - CodeCriticBench offers a holistic evaluation through basic and advanced critique methods, with fine-grained checklists for in-depth analysis. - Experimental results demonstrate the effectiveness of CodeCriticBench in enhancing LLM critique capabilities. ๐Ÿ‘‰ Paper link: huggingface.co/papers/2502.1โ€ฆ
1
2
19
๐Ÿ“š AI Native Daily Paper Digest - 20250225 ๐ŸŒŸ Follow @AINativeF for the latest insights on AI Native. Covering AI research papers from Hugging Face, featured in the image. ๐Ÿ’ก Stay updated with the latest research trends and dive deep into the future of AI! ๐Ÿš€ #AI #HuggingFace #AIPaper #AINative #AINF โ€” Appendix: Today's AI research papers โ€” 1. Thus Spake Long-Context Large Language Model 2. VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing 3. DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks 4. Slamming: Training a Speech Language Model on One GPU in a Day 5. Audio-FLAN: A Preliminary Release 6. GCC: Generative Color Constancy via Diffusing a Color Checker 7. CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models 8. Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment 9. Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning 10. Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models 11. Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam 12. RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers 13. Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration
4
4
174