CS PhD Candidate @Yale | Research Intern @Google

Joined September 2015
37 Photos and videos
Gabrielle Kaili-May Liu retweeted
When multimodal AI meets real-world expertise, reasoning gets harder, deeper, and much more exciting. Join us at KnowledgeMR @ #CVPR2026 to push this frontier forward! ๐Ÿ—“๏ธThu June 4 | 8am | Room 704/706 Speakers: @thoma_gu @huang_biwei @pliang279 @MengdiWang10 @xwang_lk
10
22
3,244
Also glad to share this was accepted as a โœจspotlightโœจto the Trustworthy AI4GOOD Workshop @ #ICML 2026!
๐Ÿ”ฅExcited to share our paper: Quantifying Faithful Confidence Expression in Large Reasoning Models (LRMs)!๐Ÿ”ฅ We trust reasoning models partly because they show their work. But do their words reflect how confident they really are? ๐Ÿค” Check our preprint to find out! Details ๐Ÿงต๐Ÿ‘‡
5
11
1,003
โš ๏ธ ๐—ž๐—ฒ๐˜† ๐—™๐—ถ๐—ป๐—ฑ๐—ถ๐—ป๐—ด #๐Ÿฒ: Different intrinsic confidence estimators produce ๐Ÿ”€ ๐—ฑ๐—ถ๐˜ƒ๐—ฒ๐—ฟ๐—ด๐—ฒ๐—ป๐˜ faithfulness profiles on identical CoT traces. This reveals fragility in prior evaluation methods & suggests LRM uncertainty signals do not maps neatly to linguistic expression.
1
66
(13/n) If you're interested in LLM reasoning, uncertainty, or faithfulness, check out our paper and analysis framework! We'd love feedback or questions ๐Ÿ™ ๐Ÿ“„ Paper: arxiv.org/pdf/2606.03969 ๐Ÿ”— Github: github.com/yale-nlp/faithfulโ€ฆ

1
1
79
(14/n) Thanks to my co-firsts @areebg9 and Asal Meskin, and @armancohan's advising!
1
59
(3/n) Yet studying FC in LRMs is uniquely hard ๐Ÿ”. Long CoT traces lack clean step boundaries, exhibit inconsistent step structure, and encode complex conditional dependencies that evolve throughout the trace โ€” making existing FC evaluation methods ill-suited to this setting.
1
97
โš ๏ธ ๐—ž๐—ฒ๐˜† ๐—™๐—ถ๐—ป๐—ฑ๐—ถ๐—ป๐—ด #๐Ÿฑ: Trajectory-level faithfulness dynamics ๐ŸŒŠ vary with model and estimator. Expressed confidence of later reasoning steps is ๐Ÿšซ not uniformly more faithful than earlier ones, despite being more calibrated with accuracy.
1
48
โš ๏ธ ๐—ž๐—ฒ๐˜† ๐—™๐—ถ๐—ป๐—ฑ๐—ถ๐—ป๐—ด #๐Ÿฐ: Prompt interventions that boost FC in standard LLMs fail to generalize to LRMs ๐Ÿ“‰. Even metacognitive prompting โ€” shown in prior work to robustly improve faithful calibration of non-reasoning models โ€” yields minimal gains in the reasoning setting.
1
68
โš ๏ธ ๐—ž๐—ฒ๐˜† ๐—™๐—ถ๐—ป๐—ฑ๐—ถ๐—ป๐—ด #๐Ÿฏ: ๐——๐—ถ๐˜€๐˜๐—ถ๐—น๐—น๐—ฎ๐˜๐—ถ๐—ผ๐—ป differentially reshapes & ๐—ฑ๐—ถ๐˜€๐˜๐—ผ๐—ฟ๐˜๐˜€ FC vs. reasoning training in ways that cannot be inferred from architecture, scale, or accuracy alone โ€” ๐ŸŽญ distilled models should ๐—ป๐—ผ๐˜ be treated as FC proxies for their teachers!
1
55
โš ๏ธ ๐—ž๐—ฒ๐˜† ๐—™๐—ถ๐—ป๐—ฑ๐—ถ๐—ป๐—ด #๐Ÿฎ: โš™๏ธ Reasoning training ๐Ÿ“‰ degrades FC. Comparing matched reasoning & non-reasoning checkpoints of the same model backbone, reasoning-tuned variants produce more hesitation, but surface-level caution does not correspond to lower internal confidence.
1
49
โš ๏ธ ๐—ž๐—ฒ๐˜† ๐—™๐—ถ๐—ป๐—ฑ๐—ถ๐—ป๐—ด #๐Ÿญ: Reasoning behaviors do not automatically translate to improved faithfulness of uncertainty expression. LRMs remain highly decisive even when frequently wrong ๐Ÿ˜ฌ, and model size provides limited assistance to LRMs, โŒ in contrast to FC of LLMs.
1
52
(6/n) We apply our framework across ๐Ÿค– 7 models, ๐Ÿงฉ 5 diverse reasoning-intensive datasets (math, science, law, multi-step soft reasoning), and various ๐Ÿงช prompt interventions, finding that faithful confidence expression remains a significant challenge for LRMs ๐Ÿ˜”.
1
56
(5/n) This gives a ๐Ÿ”ญ multi-dimensional view of faithfulness throughout a CoT trace. We also introduce a ๐Ÿ’ก prefix-conditioned sampling approach to control for conditional dependencies and structure across sampled trace โ€”a key challenge that existing methods overlook.
1
74
(4/n) To address this, we present a novel framework to systematically quantify FC in LRMs ๐ŸŽฏ. Our framework analyzes linguistic decisiveness against 3๏ธโƒฃ complementary sources of internal confidence, derived from ๐Ÿ•ต hidden states, โš™๏ธ token probabilities, & sampling consistency โš–๏ธ.
1
82
๐Ÿ”ฅExcited to share our paper: Quantifying Faithful Confidence Expression in Large Reasoning Models (LRMs)!๐Ÿ”ฅ We trust reasoning models partly because they show their work. But do their words reflect how confident they really are? ๐Ÿค” Check our preprint to find out! Details ๐Ÿงต๐Ÿ‘‡
3
9
1,524
(2/n) Faithful calibration (FC)โ€”the alignment between models' ๐˜ช๐˜ฏ๐˜ต๐˜ณ๐˜ช๐˜ฏ๐˜ด๐˜ช๐˜ค & ๐˜ฆ๐˜น๐˜ฑ๐˜ณ๐˜ฆ๐˜ด๐˜ด๐˜ฆ๐˜ฅ uncertaintyโ€”is a persistent failure mode for LLMs ๐Ÿ˜”. This is especially consequential for LRMs, whose reasoning traces are seen as concrete signals of competence & confidence.
1
100
๐Ÿ”ฅ Excited to share my new preprint: Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence? ๐Ÿ”ฅ When an LLM says "I think" or "probably," does it actually mean something consistent internally? The answer: not really ๐Ÿ˜ฌ Check out details in ๐Ÿงต(1/n):
3
6
20
1,056
(15/n) If you're interested in LLM uncertainty, reliability, or human-AI interaction, please check out our paper and analysis framework! We'd love feedback or questions :)) ๐Ÿ“„ Paper: arxiv.org/pdf/2605.28778 ๐Ÿ”— Github: github.com/yale-nlp/marker_iโ€ฆ

1
1
85
(16/n) Shoutout to my co-author @armancohan!
64