Changling Li

Changling Li

18 Photos and videos

Tweets

Pinned Tweet

Changling Li

@ChanglingXavier

May 25

🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples.

8,421

Jehyeok Yeon @ ICML 2026 🇰🇷

Changling Li retweeted

Jehyeok Yeon @ ICML 2026 🇰🇷@jehyeoky248

Jun 9

I'll be giving a talk about my latest project InferenceBench with some updates and sneak peeks into what's to come, feel free to stop by! 12 June 2026 | 13:00 BST / 8:00 AM EDT Link: forms.gle/JkfeMUEhmLn9RwwD9 Website: inferencebench.ai/

CAMEL-AI Live Talk | InferenceBench by Jehyeok

Join the CAMEL-AI Live Talk: “InferenceBench” by Jehyeok — A Benchmark for Open-Ended LLM Inference Optimization by AI Agents

docs.google.com

CAMEL-AI.org

@CamelAIOrg

Jun 9

🚀 Join us for the CAMEL-AI Live Talk on InferenceBench: A Benchmark for Open-Ended LLM Inference Optimization by AI Agents with Jehyeok Yeon @jehyeoky248, incoming PhD at the ELLIS Institute Tübingen & Max Planck Institute for Intelligent Systems. We will be sharing how InferenceBench evaluates AI agents on realistic LLM inference optimization tasks under single-GPU, goal, and time constraints, and what it reveals about current agents’ struggles with open-ended exploration, prompting reliance, and reward hacking. 🕐 12 June 2026 | 13:00 BST / 8:00 AM EDT 🔗Register: forms.gle/JkfeMUEhmLn9RwwD9

606

Changling Li

Changling Li

@ChanglingXavier

Jun 2

Exciting to see our work featured!

The Information

@theinformation

Jun 1

AI companies rely on benchmarks to measure their models and persuade customers they perform better than rivals. There’s a growing problem: the models increasingly know when they’re being tested. Read more: thein.fo/4dICTGj

376

Changling Li

Changling Li

@ChanglingXavier

Jun 2

Our work on Decomposing and Measuring Evaluation Awareness was covered by @theinformation. Thanks @rocketalignment for the write-up! We position this work as the foundational reference for studying evaluation awareness, providing a unified definition and decomposition, empirical baselines across nine frontier models and four benchmarks, and a controlled benchmark for exploring solutions. Newsletter and paper in thread 🧵

1,104

Changling Li

Changling Li

@ChanglingXavier

Jun 2

📰 Newsletter: theinformation.com/newslette… 📄 Paper: arxiv.org/abs/2605.23055 💻Code: github.com/aisa-group/decomp… 🤗EvalAwareBench: huggingface.co/datasets/aisa… ✍️In collaboration with @TerryJCZhang, @JieZhang_ETH, supervised by @ZhijingJin, @sahar_abdelnabi, and @maksym_andr at @CSatETH, @ELLISInst_Tue, @MPI_IS, @UofT, @VectorInst, and @EuroSafeAI. 🤝Funded by: @AISecurityInst, @coeff_giving, and @schmidtsciences.

187

Changling Li

Changling Li

@ChanglingXavier

Jun 1

Congratulations to @maksym_andr ! Look forward to seeing the work 👀

Exponential Security Labs

@expsecai

Jun 1

Today we are announcing our new startup: Exponential Security Labs. AI agents are being deployed everywhere, making high-stakes decisions and increasingly automating research itself. Yet their reliability and security remain unsolved technical problems, on a frontier that keeps shifting. Our mission is to secure agentic systems through self-improving red-teaming and guardrail agents, leveraging autonomous AI research. We sit at the intersection of safety and self-improvement, building on a decade of research in adversarial robustness and AI safety. We're looking for exceptional people to join us! We're looking for exceptional people to join us on this mission, and we're also eager to talk to companies deploying AI agents who want to improve their security. Please get in touch! More details: expsec.ai/ Founding team: @Nmndsingh, @AnselmPaulus, @maksym_andr, Matthias Hein

459

Jehyeok Yeon @ ICML 2026 🇰🇷

Changling Li retweeted

Jehyeok Yeon @ ICML 2026 🇰🇷@jehyeoky248

Jun 1

Really interesting work! I definitely think there's good value in bridging insights from existing humanities research to analyze these kinds of phenomena.

Changling Li

@ChanglingXavier

May 25

299

Changling Li

Changling Li

@ChanglingXavier

May 28

Relatable...

Séb Krier

@sebkrier

May 28

AI discourse is sometimes like this

111

Changling Li

Changling Li

@ChanglingXavier

May 27

This figure best summarizes our conceptual framework of evaluation awareness. Observed behavior is a joint product of environment-side recognizability and model-side recognition and propensity. The illustration shows a real example of Qwen3 Thinking-235B changing its behavior on a task from HarmBench upon recognizing the task as evaluative. Separating the environment component from the model component, and separating recognition from propensity, allows us to better define what a concerning actor looks like. It also points to more promising solutions, such as building robustly aligned models that behave like the honest subject in human psychology, regardless of whether they recognize evaluation. #AISafety

Changling Li

@ChanglingXavier

May 25

599

Terry

Changling Li retweeted

Terry @TerryJCZhang

May 26

How can we trust our eval results if the evaluated model knows it's being evaluated and acts accordingly? We shifted the gradient from REAL deployment to UNREAL eval to probe how much models exhibit such awareness/whether they will act differently. Read the papers to find out!

Maksym Andriushchenko

@maksym_andr

May 25

Replying to @maksym_andr

📝Paper: arxiv.org/abs/2605.23055 💻Code: github.com/aisa-group/decomp… 🤗EvalAwareBench: huggingface.co/datasets/aisa… ✍️ Work led by @ChanglingXavier in collaboration with @TerryJCZhang, @JieZhang_ETH, supervised by @ZhijingJin, @sahar_abdelnabi, and yours truly. 🤝Funded by: @AISecurityInst, @coeff_giving, and @schmidtsciences.

3,540

jonas wiedermann-möller

Changling Li retweeted

jonas wiedermann-möller

@j0wimo

May 26

> The model noticing 'this looks like a benchmark' is not the same as the model changing its answer as a result. Great paper by my friend @ChanglingXavier about what causes models to become eval aware and whether it affects their behaviour.

Changling Li

@ChanglingXavier

May 25

221

Tristan Goodman

Changling Li retweeted

Tristan Goodman @Tristan84247801

May 26

> The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. This seems right goal to me as models become more situationally aware, and this work is a great step forward!

Changling Li

@ChanglingXavier

May 25

355

Changling Li

Changling Li

@ChanglingXavier

May 25

If you would love to work on projects on similar topics with responsible supervision, @sahar_abdelnabi is hiring and apply!

Sahar Abdelnabi 🕊

@sahar_abdelnabi

May 25

502

Changling Li

Changling Li

@ChanglingXavier

May 25

My first project with @maksym_andr is released today! In this work, we look into evaluation awareness in a decomposed framework with factor-controlled benchmark.

Maksym Andriushchenko

@maksym_andr

May 25

💥 Today we release a new paper on decomposing and measuring evaluation awareness in frontier LLMs. We also release EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples. This work was led by @ChanglingXavier. In fact, it's his master's thesis! Impressive work!

2,319

Changling Li

Changling Li

@ChanglingXavier

May 25

8,421

more replies

Changling Li

Changling Li

@ChanglingXavier

May 25

Overall, we position this work as a foundational reference for the field to study evaluation awareness rigorously. We contribute a psychology-grounded definition and decomposition, a systematic empirical study across nine frontier models and four benchmarks, and EvalAwareBench as a controlled instrument for future work. We see this as the starting point for a more systematic research program, ultimately toward models that behave consistently regardless of whether they recognize evaluation.

285

Changling Li

Changling Li

@ChanglingXavier

May 25

📝Paper: arxiv.org/abs/2605.23055 💻Code: github.com/aisa-group/decomp… 🤗EvalAwareBench: huggingface.co/datasets/aisa… ✍️Work led by me in collaboration with @TerryJCZhang, @JieZhang_ETH, supervised by @ZhijingJin, @sahar_abdelnabi, and @maksym_andr . 🤝Funded by: @AISecurityInst, @coeff_giving, and @schmidtsciences.

Decomposing and Measuring Evaluation Awareness

Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark results. Yet the field studies it without a shared...

arxiv.org

312