Research Scientist @ Apple | CMU ML

Joined August 2024
Photos and videos
Ryan Shar retweeted
I will be presenting EDIT-Bench as an Oral at ICLR on Friday 4/23! Session 4D starts at 3:15 and the talk is at 3:39. We will also be at poster session 3 in the morning. See you all there!
19 Nov 2025
Tired of evaluating LLMs on made-up problems that look nothing like real tasks? Introducing EDIT-Bench, a code editing benchmark built from in-the-wild user interactions in VSCode. Real-world edits are challenging: ๐—ผ๐—ป๐—น๐˜† ๐Ÿญ/๐Ÿฐ๐Ÿฌ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐˜€๐—ฐ๐—ผ๐—ฟ๐—ฒ > ๐Ÿฒ๐Ÿฌ% ๐—ฝ๐—ฎ๐˜€๐˜€@๐Ÿญ.
8
31
4,444
Ryan Shar retweeted
New preprint alert ๐Ÿšจ Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /๐Ÿงต
19
27
256
26,289
Ryan Shar retweeted
19 Nov 2025
Tired of evaluating LLMs on made-up problems that look nothing like real tasks? Introducing EDIT-Bench, a code editing benchmark built from in-the-wild user interactions in VSCode. Real-world edits are challenging: ๐—ผ๐—ป๐—น๐˜† ๐Ÿญ/๐Ÿฐ๐Ÿฌ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐˜€๐—ฐ๐—ผ๐—ฟ๐—ฒ > ๐Ÿฒ๐Ÿฌ% ๐—ฝ๐—ฎ๐˜€๐˜€@๐Ÿญ.
2
12
42
15,540
Ryan Shar retweeted
Iโ€™m excited to share new work from Datadog AI Research! We just released Toto, a new SOTA (by a wide margin!) time series foundation model, and BOOM, the largest benchmark of observability metrics. Both are available under the Apache 2.0 license. ๐Ÿงต
5
52
242
38,236
Ryan Shar retweeted
Blog post on @CopilotArena out now!
9 Apr 2025
blog.ml.cmu.edu/2025/04/09/cโ€ฆ How do real-world developer preferences compare to existing evaluations? A CMU and UC Berkeley team led by @iamwaynechi and @valeriechen_ created @CopilotArena to collect user preferences on in-the-wild workflows. This blogpost overviews theย  design and deployment of Copilot Arena new insights into developer code preferences.
2
15
509
Ryan Shar retweeted
4 Mar 2025
What do developers ๐˜ณ๐˜ฆ๐˜ข๐˜ญ๐˜ญ๐˜บ think of AI coding assistants? In October, we launched @CopilotArena to collect user preferences on real dev workflows. After months of live service, weโ€™re here to share our findings in our recent preprint. Here's what we have learned /๐Ÿงต
16 Oct 2024
Introducing Copilot Arena - Interactive coding evaluation in the wild. Our extension lets you test top models for free, right in VSCode. Let's vote and build the Copilot leaderboard! Download here: marketplace.visualstudio.comโ€ฆ Led by @iamwaynechi and @valeriechen_ at CMU. 1/๐Ÿงต
3
32
160
71,045
Ryan Shar retweeted
26 Feb 2025
When benchmarks talk, do LLMs listen? Our new paper shows that evaluating that code LLMs with interactive feedback significantly affects model performance compared to standard static benchmarks! Work w/ @RyanShar01, @jacob_pfau, @atalwalkar, @hhexiy, and @valeriechen_! [1/6]
2
15
54
10,559
Ryan Shar retweeted
๐Ÿงต on surprising revelations from our study of specialized foundation models (FMs beyond vision/text): after evaluating dozens of scientific & time series FMs we found that most werenโ€™t even competitive with simple supervised models, some with as little as 513 parameters. 1/n
3
62
243
43,053
Ryan Shar retweeted
12 Nov 2024
Which model is best for coding? @CopilotArena leaderboard is out! Our code completions leaderboard contains data collected over the last month, with >100K completions served and >10K votes! Letโ€™s discuss our findings so far๐Ÿงต
17
77
530
136,021