Yuling Gu

Yuling Gu

27 Photos and videos

Tweets

Pinned Tweet

Yuling Gu @gu_yuling

Mar 6

🎉 SimpleToM has been accepted to #ICLR2026! LLMs can tell you what someone knows (explicit ToM). But when asked to apply it to predict behavior or judge actions (applied ToM), even frontier LLMs still fail. 🤯 The gap between knowing and applying is real… and huge. 👀 1/

121

15,054

Yuling Gu

Yuling Gu @gu_yuling

Apr 22

Check out SimpleToM at #ICLR2026 where we reveal a critical fragility in LLMs’ social reasoning — the explicit vs. applied ToM gap. 🗓️Fri, Apr 24, 2026 3:15 PM – 5:45 PM BRT 📍Pavilion 3 P3-#1407

Yuling Gu @gu_yuling

Mar 6

1,817

Yuling Gu

Yuling Gu @gu_yuling

Apr 22

I can’t be there in person due to visa issues, but please meet my amazing co-authors! My DM (and email) are open if you’d like to connect!

372

Yuling Gu

Yuling Gu @gu_yuling

Mar 6

121

15,054

more replies

Yuling Gu

Yuling Gu @gu_yuling

Mar 6

SimpleToM exposes this gap 🔎 and provides a benchmark to diagnose, improve, and push LLMs toward robust social reasoning 🚀 Try SimpleToM on any model : huggingface.co/datasets/alle… 5/

allenai/SimpleToM · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

445

Yuling Gu

Yuling Gu @gu_yuling

Mar 6

Work done during my time at @allen_ai with wonderful collaborators Oyvind Tafjord, @hyunw_kim, @jaredlcm, @Ronan_LeBras, Peter Clark, @YejinChoinka. 📜 Paper: arxiv.org/abs/2410.13648 💻 Code: github.com/yulinggu-cs/Simpl… 6/

SimpleToM: Exposing the Gap between Explicit ToM Inference and...

Large language models (LLMs) are increasingly tested for a "Theory of Mind" (ToM) - the ability to attribute mental states to oneself and others. Yet most evaluations stop at explicit belief...

arxiv.org

398

Kyunghyun Cho

Yuling Gu retweeted

Kyunghyun Cho

@kchonyc

9 Dec 2025

i gave a keynote talk at NeurIPS'25 just last week. here's the slide deck (link below) i've used to share my thoughts on who we are and what we do.

245

20,592

Yuling Gu

Yuling Gu @gu_yuling

20 Nov 2025

Super proud of the amazing work that my Ai2 friends have been doing! 🤩 Check this out! ✨

Ai2

@allen_ai

20 Nov 2025

Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an open model flow—not just the final weights, but the entire training journey. Best fully open 32B reasoning model & best 32B base model. 🧵

1,148

Danica Dillion

Yuling Gu retweeted

Danica Dillion

@danicajdillion

19 Sep 2025

🌍 Introducing WorldValuesBench! A benchmark to evaluate how well LLMs reflect cultural differences in human values. Built from 94k participants in the World Values Survey → 20M examples of (demographics, value question → answer). 🧵

651

David Heineman

Yuling Gu retweeted

David Heineman @davidheinnman

19 Aug 2025

Evaluating language models is tricky, how do we know if our results are real, or due to random chance? We find an answer with two simple metrics: signal, a benchmark’s ability to separate models, and noise, a benchmark’s random variability between training steps 🧵

Ai2

@allen_ai

19 Aug 2025

📢 New paper from Ai2: Signal & Noise asks a simple question—can language model benchmarks detect a true difference in model performance? 🧵

241

47,096

Yuling Gu

Yuling Gu @gu_yuling

30 Apr 2025

Excited to be at #NAACL2025 in Albuquerque this week! I'll be presenting "OLMES: A Standard for Language Model Evaluations" (arxiv.org/abs/2406.08446)! Work done with my wonderful collaborators at @allen_ai ❤️

OLMES: A Standard for Language Model Evaluations

Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models can be particularly challenging, as choices of...

arxiv.org

4,351

more replies

Yuling Gu

Yuling Gu @gu_yuling

30 Apr 2025

This effort toward an open language model evaluation standard doesn’t just end here. Since the submission of our NAACL paper, we have added more tasks to OLMES, including generative and reasoning tasks, all openly available in our repository (github.com/allenai/olmes).

GitHub - allenai/olmes: Reproducible, flexible LLM evaluations

Reproducible, flexible LLM evaluations. Contribute to allenai/olmes development by creating an account on GitHub.

github.com

650

Yuling Gu

Yuling Gu @gu_yuling

30 Apr 2025

Come to our poster session on Friday, May 2, 9-10.30 am (Hall 3) to chat more!

366

Ai2

Yuling Gu retweeted

Ai2

@allen_ai

31 Mar 2025

Imagine AI doing science: reading papers, generating ideas, designing and running experiments, analyzing results… How many more discoveries can we reveal? 🧐 Meet CodeScientist, a promising next step toward autonomous scientific discovery. 🧵

365

41,613

Kyle Lo

Yuling Gu retweeted

Kyle Lo

@kylelostat

3 Jan 2025

kicking off 2025 with our OLMo 2 tech report while payin homage to the sequelest of sequels 🫡 🚗 2 OLMo 2 Furious 🔥 is everythin we learned since OLMo 1, with deep dives into: 🚖 stable pretrain 🚔 lr anneal 🤝 data curricula 🤝 soups 🚘 tulu post-train 🚜 compute infra 👇🧵

ALT First page of tech report titled "2 OLMo 2 Furious"

360

47,376

Ai2

Yuling Gu retweeted

Ai2

@allen_ai

21 Nov 2024

Meet Tülu 3 -- a set of state-of-the-art instruct models with fully open data, eval code, and training algorithms. We invented new methods for fine-tuning language models with RL and built upon best practices in the community to scale synthetic instruction and preference data. Demo, GitHub, technical report, and models below 👇

131

524

218,621

Yuling Gu

Yuling Gu @gu_yuling

25 Oct 2024

⚠️ Introducing SimpleToM, exposing a jarring gap in the Theory-of-Mind capabilities of current frontier LLMs: 😲 They fail to implicitly apply mental state inferences, even when they can easily infer these states for two-sentence stories. 😲 📜 arxiv.org/abs/2410.13648 1/

160

24,894

more replies

Yuling Gu

Yuling Gu @gu_yuling

25 Oct 2024

“Knowledge isn’t power until it’s applied” -- Dale Carnegie 💻 We make SimpleToM publicly available: huggingface.co/datasets/alle…. 📜 For more details, we invite you to check out our paper (arxiv.org/abs/2410.13648)! ✍️ And blog post (blog.allenai.org/d32dd28d83d…)! 10/

allenai/SimpleToM · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

552

Yuling Gu

Yuling Gu @gu_yuling

25 Oct 2024

Work done at @allen_ai with my wonderful collaborators: Oyvind Tafjord, @hyunw_kim, @jaredlcm, @Ronan_LeBras, Peter Clark, @YejinChoinka from @allen_ai @uwnlp @stanfordnlp 11/11

437