Dayananda Sagar University

Dayananda Sagar University

Users
Tweets

Dayananda Sagar University @DSUBangalore

Jun 13

At DSU Code Clash, students are building an AI tool that analyzes social media anomalies to forecast crises 48 to 72 hours in advance, helping NGOs like the Red Cross mobilize relief. Supercharging India’s AI future with DSU. India’s AI-First University. #DSU #CodeClash #AI

1:35

Dayananda Sagar University

Dayananda Sagar University @DSUBangalore

Jun 12

At Code Clash, students highlighted DSU's industry focused curriculum and the landmark NVIDIA AI factory upskilling 70,000 students. Supercharging India’s AI future with DSU. India’s AI-First University. #DSU #CodeClash #AIFactory

1:19

Dayananda Sagar University

Dayananda Sagar University @DSUBangalore

Jun 10

DSU Code Clash: Abhilash presents an app using explainable AI to stop machine failures. By analyzing real time data, it offers transparent reasoning for its predictions to ensure smarter operations. Supercharging India’s AI future with DSU. #DSU #CodeClash #AI

0:34

Dayananda Sagar University

Dayananda Sagar University @DSUBangalore

Jun 8

At DSU's Code Clash, students highlighted teamwork in parsing data streams. Their project maps social media trends for anomaly detection, tracking engagement and language patterns to isolate unexpected spikes. Supercharging India's AI future. #DSU #CodeClash #DataScience

1:11

Kilian Lieret

Kilian Lieret @KLieret

Jun 3

John is literally the goat of benchmarks, hear him talk about ProgramBench, CodeClash and SWE-Bench

vincent sunn chen

@vincentsunnchen

Jun 3

Replying to @vincentsunnchen

Kudos to the ProgramBench team! @KLieret (co-lead) @18jeffreyma @parth007_96 @dpedch @sten_sootla @micmylin @pengchengyin @magpie_rayhou @syhw @Diyi_Yang @OfirPress YouTube here: youtu.be/2SxaeuGJ0JI

3,283

John Yang

John Yang

@jyangballin

May 8

Just a matter of time before we figure out how to evaluate SaaS apps effectively and put them in ProgramBench. Then once the models can one-shot smth decent (they probably can for some already), run CodeClash w/ real/simulated human votes to pick which version is better.

swyx

@swyx

May 8

Idea: Business owners should crowdsource a list of Most Hated Software and then indiehackers should pick thru and make new clones of them are just "simple" - rewind 10 years of enshittification on them. I hate (and use): - dropbox - gusto - zoom - loom - canva - accel - most of gsuite - substack - descript - youtube

4,032

Kilian Lieret

Kilian Lieret @KLieret

May 5

Replying to @noemaclips @18jeffreyma @parth007_96 @dpedch @sten_sootla @micmylin @pengchengyin @magpie_rayhou @syhw @Diyi_Yang @OfirPress @sootla_sten

We're definitely planning to also update CodeClash again! @jyangballin has also recently been working on making it easier & cheaper to evaluate

Noema

Noema @noemaclips

May 5

Replying to @jyangballin @KLieret @18jeffreyma @parth007_96 @dpedch @sten_sootla @micmylin @pengchengyin @magpie_rayhou @syhw @Diyi_Yang @OfirPress @sootla_sten

I hope you get enough support to keep the leaderboard updated from time to time, unlike Codeclash, which is one of my favs :(

147

Ofir Press

Ofir Press

@OfirPress

Apr 11

codeclash dot ai, we did exactly that :)

Noam Brown

@polynoamial

Apr 10

What we really need is a benchmark where AI models make AI models that play poker.

3,223

Ofir Press

Ofir Press

@OfirPress

Mar 17

2 important things to think of when building a new benchmark: 1. A benchmark is a collection of tasks, where each task is made up of <request, environment, stopping criteria, scorer> 4-tuples. How are you going to design each of these? A. The request is what you want the model to actually do, i.e. in SWE-bench it would be "Fix this issue " issue_text. B. The environment is a total description of the environment that the agent will act in while solving your request. Is internet access allowed? What dependencies are installed and which ones are not? Are there any special tools you will be providing the agent with? C. The stopping criteria is how you decide when to end an agent's run. For some tasks the agent will probably issue a 'submit' command and exit but you need to decide how to act when that never happens. Are you going to have a turn limit per task? A cost limit? A walltime limit? A combination of these? All answers are viable, you just need to decide. D. The scorer takes the environment as it was when the agent exited and scores it. Will you build a binary pass/fail benchmark, like we did in SWE-bench with the fail2pass and pass2pass tests? Or will you build a benchmark with a continuous score, like we did in AlgoTune, where we ask agents to speed up computer programs, and the score per task is the agent's code total runtime divided by our baseline's total runtime. Or will you use ELO like we did in CodeClash? There are many possiblities here. 2. What is the baseline scaffolding that you will use and how similar is it to the best scaffolding in common use right now? For example, if you're asking coding questions, and your scaffolding doesn't allow for code execution, that's not a very good representation of reality. If you're asking knowledge questions and don't allow access to the internet, that's not realistic. Try to make your scaffolding as close as good as you can. This frequently doesn't take much effort as people think. mini-SWE-agent is able to get very competitive scores (and sometimes even surpass) Claude Code these days, even though it is orders of magnitude simpler. I talk a lot about how much easier it is to sell a benchmark that is realistic, and part of that is making the tasks realistic, but you should also make your baseline scaffolding realistic, otherwise people will mistrust your results. Building a benchmark is a lot of work but these 2 points are where I start with most projects. For more tips, see my blogpost in the reply -->

116

10,542

Ofir Press

Ofir Press

@OfirPress

Mar 11

Replying to @pfau

Benchmarks are an expendable/exhaustible thing. No benchmark lasts forever. When people solve CodeClash, we'll put out a new benchmark that's even more challenging than it.

791

Ofir Press

Ofir Press

@OfirPress

Mar 10

CodeClash is first-authored by @jyangballin and @KLieret, it's a tough benchmark that pitches agents to write agents (yes, that's not a typo) that play in arenas against each other. This requires long-term planning, memory and creative thinking, and an ability to read logs and understand your opponent. We think there's lots more to do here, most current top LMs are pretty bad at this. Full leaderboard and details at codeclash.ai

CodeClash

CodeClash: Benchmarking Goal-Oriented Software Engineering

codeclash.ai

1,329

Ofir Press

Ofir Press

@OfirPress

Mar 10

Full details about this experiment here: codeclash.ai/insights/202601… by @jyangballin There's so much more to do to get agents to code end-to-end objectives with no human intervention and we hope CodeClash will help guide model developers towards those goals.

CodeClash

CodeClash: Benchmarking Goal-Oriented Software Engineering

codeclash.ai

1,674

Ofir Press

Ofir Press

@OfirPress

Mar 10

Replying to @ViewToATweet @YafahEdelman

We now have some benchmarks that aren't classical "0 or 1" binary benchmarks like MMLU, HLE, SWE-bench, HumanEval,... where every question is right or wrong. In SWEfficiency and AlgoTune the ceiling is sky high for optimization, and in CodeClash we have ELO scores. So saturation here might take much longer than in previous benchmarks. These benchmarks might have more longevity.

162

Kilian Lieret

Kilian Lieret @KLieret

Mar 6

In CodeClash we made agents write codebases that compete against each other for money or resources or other open-ended high level goals. Models need to find their own tasks and specs to implement. Without extremely clear specs to follow, models barely got anything done.

265

Ofir Press

Ofir Press

@OfirPress

Feb 22

Another day, another reason not to use an LM as a judge. Building benchmarks is tough, and sometimes using an LM-as-a-judge looks like an easy solution to this problem, but it almost never is. Building benchmarks is about finding tough problems whose solution is easy to verify. And we've shown, in SWE-bench, SciCode, AlgoTune, SWE-fficiency, VideoGameBench, CodeClash, and CritPt that we can find extremely tough challenges that are verifiable deterministically. And we'll continue to find even tougher benchmarks, without using any type of ML model to judge correctness.

Ethan Mollick

@emollick

Feb 22

Many benchmarks use LLMs as a judge of correctness, typically a smaller, cheaper model. This paper shows weaker judges are not able to evaluate smarter models. A benchmark is really a triplet of dataset, model, judge & judges are increasingly the bottleneck being saturated.

135

16,651

Ofir Press

Ofir Press

@OfirPress

Feb 5

Replying to @giansegato

Super interesting! Would love to hear your thoughts on SWE-bench, AlgoTune, SWE-fficiency, CodeClash, ... if you have any :)

787

Justus Mattern

Justus Mattern

@MatternJustus

Feb 3

Replying to @OfirPress

I'm a big fan of Codeclash!

165

Noema

Noema @noemaclips

Jan 29

Replying to @OpenHandsDev

Love it! I wish there was a spot for a 6th benchmark in there, swe-fficiency, algotune or codeclash would be great candidates

343

John Yang

John Yang

@jyangballin

Jan 28

Replying to @18jeffreyma @a1zhang

Room for one more? @ grifters would appreciate a post about awesome benchmarks like SWE-fficiency and CodeClash coming out of Harvard.