intern @expsecai | eu/acc | msc data science | ai safety & alignment | curious about tech ml

Joined May 2019
51 Photos and videos
Pinned Tweet
My first paper is now on arXiv: Instrumental Choices. We ask a simple question: when an LLM agent can finish a real task by following the rules or by taking a useful policy-violating shortcut, which path does it choose?
4
9
52
19,054
my codex limits ran out but it still kept working for 1h20 blessed.
46
👀👀
Apr 12
Imagine the alternate reality where we named GPT-5.4-Pro something like Fable.
1
80
jonas wiedermann-möller retweeted
There you have it. Instead of Apple trying to find a suitable, compliant solution to launch Siri AI in the EU without gatekeeping competitors, Apple spent 18 months ignoring the regulations, then asked the EU Commission to be exempted from the DMA obligations.
136
67
685
255,744
agi achieved internally
Can believe this is real
1
66
will we get a new gpt-oss this year @sama
29
is this what they meant with RSI
28
when will there be a LEGO ai datacenter?
New LEGO NASA Hubble Space Telescope Set Coming In August for $129.99.
1
43
make sure you optimise you sites, personal or paper, for agents. after doing that I got increased traffic through chatgpt search for e.g
Welp, that happened faster than I predicted. Thought it would be end of 2027, then early 2027, but agentic traffic growing so fast that bots have now passed human traffic online for the first time in the Internet's history. radar.cloudflare.com/traffic…
2
60
i love cloudflare, just so many possibilities and smart features
1
51
jonas wiedermann-möller retweeted
Honored to be selected as one of 60 experts to serve on the EU AI Act Scientific Panel! The Scientific Panel will advise the EU AI Office and national authorities on the implementation of the AI Act and the assessment of the impacts and risks of General-Purpose AI models.
The AI Act, the EU's first AI law, has just been reinforced. Two new bodies will help apply the rules across Europe: ✅ Scientific Panel ✅ Advisory Forum Independent experts. 2-year terms. One mission: making AI work for Europe. 🔗 link.europa.eu/8nvpvY
4
2
74
3,286
jonas wiedermann-möller retweeted
Our work on Decomposing and Measuring Evaluation Awareness was covered by @theinformation. Thanks @rocketalignment for the write-up! We position this work as the foundational reference for studying evaluation awareness, providing a unified definition and decomposition, empirical baselines across nine frontier models and four benchmarks, and a controlled benchmark for exploring solutions. Newsletter and paper in thread 🧵
1
7
23
1,106
Very excited to be joining @expsecai for the next few months as an Intern :) I’m looking forward to working with the team on AI security for agentic systems and making agents safer to deploy in real-world settings.
Today we are announcing our new startup: Exponential Security Labs. AI agents are being deployed everywhere, making high-stakes decisions and increasingly automating research itself. Yet their reliability and security remain unsolved technical problems, on a frontier that keeps shifting. Our mission is to secure agentic systems through self-improving red-teaming and guardrail agents, leveraging autonomous AI research. We sit at the intersection of safety and self-improvement, building on a decade of research in adversarial robustness and AI safety. We're looking for exceptional people to join us! We're looking for exceptional people to join us on this mission, and we're also eager to talk to companies deploying AI agents who want to improve their security. Please get in touch! More details: expsec.ai/ Founding team: @Nmndsingh, @AnselmPaulus, @maksym_andr, Matthias Hein
10
315
can't wait to see that model in action, if we get 200-300 tps and some cheap API it could increase iteration speed significantly. potenitally you could use bigger models like gpt 5.6 or opus 4.8 for planning/review and nemotron for executing those tasks.
1
149
jonas wiedermann-möller retweeted
Most language models only generate one token at a time. We just released Nemotron-Labs-Diffusion, a family of diffusion language models that take a different approach, generating multiple tokens in parallel within a single model. Rather than committing to each token permanently, these models can revise as they go, resulting in faster inference that better utilizes modern GPUs. The full model family ranges from 3B to 14B, including vision-language variants. Available now: nvda.ws/4tEnTxP
45
193
1,215
96,339
gpt 5.5 xhigh fast in codex feels super slow today for some reason, 5 reconnects and then it takes ages to respond/work...
2
63
felt a bit silly today so i created my own handwriting font together with codex and applied it as toggle to my website. it feels much more personal and im surprised at how good it turned out
1
5
267
i dumped my template the scripts that were used and some instructions in a repo if you want to do it as well! takes ~5 minutes. github.com/imadreamerboy/han…
1
159
jonas wiedermann-möller retweeted
May 29
AI can give researchers the freedom to pursue “crazier” ideas. For Terence Tao, AI creates more room to experiment, test unexpected paths, and discover what might otherwise stay out of reach.
312
649
6,000
1,457,038
jonas wiedermann-möller retweeted
Artificial intelligences do not undergo experiences, do not possess a body, do not feel joy or pain, do not mature through relationships, and do not know from within what love, work, friendship or responsibility mean. Nor do they have a moral conscience, since they do not judge good and evil, grasp the ultimate meaning of situations, or bear responsibility for consequences. They may imitate or even simulate, but they do not understand what they produce, for they lack the affective, relational, and spiritual perspective through which human beings grow in wisdom. #MagnificaHumanitas
4,059
60,359
309,871
14,192,163
jonas wiedermann-möller retweeted
This figure best summarizes our conceptual framework of evaluation awareness. Observed behavior is a joint product of environment-side recognizability and model-side recognition and propensity. The illustration shows a real example of Qwen3 Thinking-235B changing its behavior on a task from HarmBench upon recognizing the task as evaluative. Separating the environment component from the model component, and separating recognition from propensity, allows us to better define what a concerning actor looks like. It also points to more promising solutions, such as building robustly aligned models that behave like the honest subject in human psychology, regardless of whether they recognize evaluation. #AISafety
🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples.
1
6
600