Kilian Lieret

Kilian Lieret

42 Photos and videos

Tweets

emily mcmilin retweeted

Kilian Lieret @KLieret

May 28

Very interesting study from Opus 4.8 card: Multi-agents do not deliver better results on ProgramBench, but they get to mediocre solutions 2x faster.

113

12,819

John Yang

emily mcmilin retweeted

John Yang

@jyangballin

May 5

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

104

246

1,576

728,457

Yuxiang Wei

emily mcmilin retweeted

Yuxiang Wei

@YuxiangWei9

Apr 30

Accepted to ICML 2026! Big thanks to all the collaborators 🎉

Yuxiang Wei

@YuxiangWei9

23 Dec 2025

Software agents can self-improve via self-play RL Introducing Self-play SWE-RL (SSR): training a single LLM agent to self-play between bug-injection and bug-repair, grounded in real-world repositories, no human-labeled issues or tests. 🧵

4,509

emily mcmilin

emily mcmilin @micmylin

Apr 26

I'll be giving a talk at the ICLR VerifAI workshop, about code execution for code world modeling, later today (Sun) at 9:05 am (Brazil time). Swing by if you are interested in learning more!

Ameesh Shah @ameeshsh

Jan 12

🗣️📣Announcing VerifAI 2: AI Verification in the Wild, an upcoming workshop at #ICLR2026!! 🗣️📣 VerifAI will gather researchers to explore topics at the intersection of genAI and trustworthy ML. Submit your work! Check out our website and CFP for more: verifai-workshop.github.io/

3,359

Zhiqing Sun

emily mcmilin retweeted

Zhiqing Sun

@EdwardSun0909

Apr 8

Excited to share Muse Spark, the first model from whole team’s work in MSL! 🚀 It’s natively multimodal and agentic. I’ve been using it for my daily coding and research tasks. Still plenty of room to improve in agentic domains, but we’re moving with great velocity. It’s a seriously good model! Check out the full breakdown and try it out in meta.ai

Meta AI

Use Meta AI assistant to get things done, create AI-generated images for free, and get answers to any of your questions.

meta.ai

Alexandr Wang

@alexandr_wang

Apr 8

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

203

20,309

Yuxiang Wei

emily mcmilin retweeted

Yuxiang Wei

@YuxiangWei9

23 Dec 2025

289

1,740

526,005

emily mcmilin

emily mcmilin @micmylin

3 Dec 2025

Better late than never to share how we built 35k unique repos (rather than commits from the same dozens of repos) into executable envs for CWM mid-training and SWE-RL post-training... x.com/syhw/status/1970960837…

Gabriel Synnaeve @syhw

24 Sep 2025

(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. ai.meta.com/research/publica…

1,718

more replies

emily mcmilin

emily mcmilin @micmylin

3 Dec 2025

Key insight: the execution env of a GitHub Actions CI workflow is fully built with deps. So we can cheaply capture it as a standalone Docker image for later execution.

154

emily mcmilin

emily mcmilin @micmylin

3 Dec 2025

We modify each repo's CI workflows to capture a single successful third-party build. For pytest repos, we inject conftest.py fixtures to verify the correct container and support optional Python execution tracing. See more in our paper: arxiv.org/abs/2510.02387

136

Taco Cohen

emily mcmilin retweeted

Taco Cohen

@TacoCohen

4 Sep 2025

The eagle-eyed goat in question being @YuxiangWei9

Lucas Beyer (bl16)

@giffmana

3 Sep 2025

Goated FAIR team just found how coding agents sometimes "cheat" on SWE-Bench Verified. It's really simple. For example, Qwen3 literally greps all commit logs for the issue number of the issue it needs to fix. lol, clever model. "cheat" cuz it's more like env hacking.

5,039

emily mcmilin

emily mcmilin @micmylin

5 Dec 2024

Thank you @AleksanderMolak for the really nice opportunity to discuss some of my prior research with you, earlier this year! x.com/aleksandermolak/status…

776

emily mcmilin

emily mcmilin @micmylin

5 Dec 2024

Link to video where our part of the convo starts: youtube.com/watch?v=sljBU_HF… Botched last attempt to send this. But better late than never...

Causal Bandits at AAAI 2024 | Part 1 | CausalBanditsPodcast.com

*Causal Bandits at AAAI 2024 || Part 1**In this special episode w...

youtube.com

342

emily mcmilin

emily mcmilin @micmylin

26 Nov 2024

Dreams can come true. I’ve joined FAIR’s CodeGen team. :)

360

34,781

Udacity

emily mcmilin retweeted

Udacity

@udacity

28 Apr 2024

💡 Interested in learning more about LLM fundamentals? In the video below, Udacity instructor Emily McMilin explains what the Transformer model is & walks you through the difference between Encoder and Decoder model architectures. bit.ly/44f0eJn #genAI #generativeAI

6,410

emily mcmilin

emily mcmilin @micmylin

30 Apr 2024

Our research showing how task underspecification can cause spurious correlations & hallucinations, from BERT to GPT-3.5 is now available as AAAI 24 proceedings: ojs.aaai.org/index.php/AAAI/… Video: underline.io/lecture/92119-u… Arxiv extended to GPT-4 Turbo Preview: arxiv.org/abs/2210.00131

1,279

emily mcmilin

emily mcmilin @micmylin

27 Feb 2024

Full house at the Causal Parrots workshop at #AAAI24 llmcp.cause-lab.net/llmcp

ALT Every seat taken and audience overflowing

1,561

emily mcmilin

emily mcmilin @micmylin

25 Feb 2024

Thanks to all who stopped by my poster last night @RealAAAI. If you are interested in talking more about causality and LLMs here at #AAAI24 or beyond, please reach out!

ALT Poster presented in AAAI 2024 Main Track for paper here: https://arxiv.org/abs/2210.00131

3,655

emily mcmilin

emily mcmilin @micmylin

23 Feb 2024

Scaling up (to GPT-4 Turbo Preview) doesn’t help fix specification-induced spurious correlations. With access to GPT-4’s logprobs, we subjected it to the same methods that had found these spurious correlations in models from BERT-base to GPT-3.5. /1 x.com/micmylin/status/173362…

emily mcmilin @micmylin

9 Dec 2023

Now accepted at #AAAI24. I started single author, indie research as a ~jr SWE, recently transitioned from hardware engineering. Learned so much along the way, with help from MLC's @savvyRL & @jasonyo, Cohere4AI's @oohaijen & @sarahookr, HF's @SashaMTL and anon peers. Grateful.

835

more replies

emily mcmilin

emily mcmilin @micmylin

23 Feb 2024

Good news: As was the case with smaller models, with GPT-4 Turbo Preview, we can exploit these spurious correlations to separate well-specified from unspecified tasks, with just one extra inference pass. 3/

Task Specification Metric results from GPT-4 Turbo Preview on the Winogender- Simplified benchmark. This method exploits our finding that well-specified texts are less likely to exhibit specification-induced spurious correlations. ‘Well-specified’ texts are demarked with a blue horizontal or vertical bar. The remaining texts have a ground truth label of ‘unspecified’. Perfect detection would appear as a horizontal row of blue ‘plus’ symbols (composed of the markers from both well-specified texts) below some thresholding line, with all the green markers above.

ALT Task Specification Metric results from GPT-4 Turbo Preview on the Winogender- Simplified benchmark. This method exploits our finding that well-specified texts are less likely to exhibit specification-induced spurious correlations. ‘Well-specified’ texts are demarked with a blue horizontal or vertical bar. The remaining texts have a ground truth label of ‘unspecified’. Perfect detection would appear as a horizontal row of blue ‘plus’ symbols (composed of the markers from both well-specified texts) below some thresholding line, with all the green markers above.

271

emily mcmilin

emily mcmilin @micmylin

23 Feb 2024

Check out the updated paper, and if you're at AAAI, check out my poster, Main Track, Friday, 7-9p arxiv.org/abs/2210.00131 done/

198