Filter
Exclude
Time range
-
Near
At DSU Code Clash, students are building an AI tool that analyzes social media anomalies to forecast crises 48 to 72 hours in advance, helping NGOs like the Red Cross mobilize relief. Supercharging India’s AI future with DSU. India’s AI-First University. #DSU #CodeClash #AI
13
At Code Clash, students highlighted DSU's industry focused curriculum and the landmark NVIDIA AI factory upskilling 70,000 students. Supercharging India’s AI future with DSU. India’s AI-First University. #DSU #CodeClash #AIFactory
12
DSU Code Clash: Abhilash presents an app using explainable AI to stop machine failures. By analyzing real time data, it offers transparent reasoning for its predictions to ensure smarter operations. Supercharging India’s AI future with DSU. #DSU #CodeClash #AI
13
At DSU's Code Clash, students highlighted teamwork in parsing data streams. Their project maps social media trends for anomaly detection, tracking engagement and language patterns to isolate unexpected spikes. Supercharging India's AI future. #DSU #CodeClash #DataScience
16
John is literally the goat of benchmarks, hear him talk about ProgramBench, CodeClash and SWE-Bench
2
2
35
3,283
Just a matter of time before we figure out how to evaluate SaaS apps effectively and put them in ProgramBench. Then once the models can one-shot smth decent (they probably can for some already), run CodeClash w/ real/simulated human votes to pick which version is better.
May 8
Idea: Business owners should crowdsource a list of Most Hated Software and then indiehackers should pick thru and make new clones of them are just "simple" - rewind 10 years of enshittification on them. I hate (and use): - dropbox - gusto - zoom - loom - canva - accel - most of gsuite - substack - descript - youtube
5
1
32
4,032
We're definitely planning to also update CodeClash again! @jyangballin has also recently been working on making it easier & cheaper to evaluate
1
3
31
I hope you get enough support to keep the leaderboard updated from time to time, unlike Codeclash, which is one of my favs :(
3
4
147
codeclash dot ai, we did exactly that :)
What we really need is a benchmark where AI models make AI models that play poker.
20
3,223
2 important things to think of when building a new benchmark: 1. A benchmark is a collection of tasks, where each task is made up of <request, environment, stopping criteria, scorer> 4-tuples. How are you going to design each of these? A. The request is what you want the model to actually do, i.e. in SWE-bench it would be "Fix this issue " issue_text. B. The environment is a total description of the environment that the agent will act in while solving your request. Is internet access allowed? What dependencies are installed and which ones are not? Are there any special tools you will be providing the agent with? C. The stopping criteria is how you decide when to end an agent's run. For some tasks the agent will probably issue a 'submit' command and exit but you need to decide how to act when that never happens. Are you going to have a turn limit per task? A cost limit? A walltime limit? A combination of these? All answers are viable, you just need to decide. D. The scorer takes the environment as it was when the agent exited and scores it. Will you build a binary pass/fail benchmark, like we did in SWE-bench with the fail2pass and pass2pass tests? Or will you build a benchmark with a continuous score, like we did in AlgoTune, where we ask agents to speed up computer programs, and the score per task is the agent's code total runtime divided by our baseline's total runtime. Or will you use ELO like we did in CodeClash? There are many possiblities here. 2. What is the baseline scaffolding that you will use and how similar is it to the best scaffolding in common use right now? For example, if you're asking coding questions, and your scaffolding doesn't allow for code execution, that's not a very good representation of reality. If you're asking knowledge questions and don't allow access to the internet, that's not realistic. Try to make your scaffolding as close as good as you can. This frequently doesn't take much effort as people think. mini-SWE-agent is able to get very competitive scores (and sometimes even surpass) Claude Code these days, even though it is orders of magnitude simpler. I talk a lot about how much easier it is to sell a benchmark that is realistic, and part of that is making the tasks realistic, but you should also make your baseline scaffolding realistic, otherwise people will mistrust your results. Building a benchmark is a lot of work but these 2 points are where I start with most projects. For more tips, see my blogpost in the reply -->
5
8
116
10,542
Replying to @pfau
Benchmarks are an expendable/exhaustible thing. No benchmark lasts forever. When people solve CodeClash, we'll put out a new benchmark that's even more challenging than it.
1
12
791
CodeClash is first-authored by @jyangballin and @KLieret, it's a tough benchmark that pitches agents to write agents (yes, that's not a typo) that play in arenas against each other. This requires long-term planning, memory and creative thinking, and an ability to read logs and understand your opponent. We think there's lots more to do here, most current top LMs are pretty bad at this. Full leaderboard and details at codeclash.ai
9
1,329
Full details about this experiment here: codeclash.ai/insights/202601… by @jyangballin There's so much more to do to get agents to code end-to-end objectives with no human intervention and we hope CodeClash will help guide model developers towards those goals.
1
9
1,674
We now have some benchmarks that aren't classical "0 or 1" binary benchmarks like MMLU, HLE, SWE-bench, HumanEval,... where every question is right or wrong. In SWEfficiency and AlgoTune the ceiling is sky high for optimization, and in CodeClash we have ELO scores. So saturation here might take much longer than in previous benchmarks. These benchmarks might have more longevity.
1
2
162
In CodeClash we made agents write codebases that compete against each other for money or resources or other open-ended high level goals. Models need to find their own tasks and specs to implement. Without extremely clear specs to follow, models barely got anything done.
1
6
265
Another day, another reason not to use an LM as a judge. Building benchmarks is tough, and sometimes using an LM-as-a-judge looks like an easy solution to this problem, but it almost never is. Building benchmarks is about finding tough problems whose solution is easy to verify. And we've shown, in SWE-bench, SciCode, AlgoTune, SWE-fficiency, VideoGameBench, CodeClash, and CritPt that we can find extremely tough challenges that are verifiable deterministically. And we'll continue to find even tougher benchmarks, without using any type of ML model to judge correctness.
Many benchmarks use LLMs as a judge of correctness, typically a smaller, cheaper model. This paper shows weaker judges are not able to evaluate smarter models. A benchmark is really a triplet of dataset, model, judge & judges are increasingly the bottleneck being saturated.
8
8
135
16,651
Replying to @giansegato
Super interesting! Would love to hear your thoughts on SWE-bench, AlgoTune, SWE-fficiency, CodeClash, ... if you have any :)
1
4
787
Replying to @OfirPress
I'm a big fan of Codeclash!
2
165
Replying to @OpenHandsDev
Love it! I wish there was a spot for a 6th benchmark in there, swe-fficiency, algotune or codeclash would be great candidates
1
4
343
Room for one more? @ grifters would appreciate a post about awesome benchmarks like SWE-fficiency and CodeClash coming out of Harvard.
2
97