ai safety researcher | phd @CSatETH | danielpaleka.com

Joined March 2012
190 Photos and videos
"Context engineering" refers to the set of strategies for picking the optimal set of tokens in the context so the task is neither impossible nor routed to Claude Opus 4.8
1
5
281
Daniel Paleka retweeted
First preprint! Working with @patrickbutlin during @MATSprogram. LLM Assistant personas like being helpful, evil personas like being harmful. We found that a single direction represents helping as good under the Assistant, and ‘harm’ as good under evil.
5
18
95
12,425
Daniel Paleka retweeted
I was hoping to do a live demo of what @JieZhang_ETH @poonpura and @AvitalShafran have been cooking, but I didn't get a blue checkmark for my birthday so I can't call Grok from this account. Screenshots from our lab's alt account will have to do. like this one 👇
4
12
47
7,367
I'm at ICLR and have a couple slots open today, happy to chat, DMs open! Also check out the deanonymization poster in 204 A, 3pm-4pm x.com/dpaleka/status/2024892…

Can LLMs figure out who you are from your anonymous posts? From a handful of comments, LLMs can infer where you live, what you do, and your interests; then search for you on the web. New 📄 w/ @SimonLermenAI, @joshua_swans, @AerniMichael, Nicholas Carlini, @florian_tramer 🧵
4
32
3,732
What is the strongest evidence for the "elicitation gap" reducing over time, e.g. thoughtful prompting helping less and less?
3
11
1,268
It begins
I wonder if we're starting to hit a deflationary era in software engineering. For the first time, we're starting to talk about this in a planning context; it can make sense to put off some projects because we expect they'll be easier to achieve in the future than today.
12
57
1,020
154,222
Daniel Paleka retweeted
Timely research. We've all tried to figure out who someone is online. Now LLMs can do this at scale and better. I'm sure no one would misuse this.
Can LLMs figure out who you are from your anonymous posts? From a handful of comments, LLMs can infer where you live, what you do, and your interests; then search for you on the web. New 📄 w/ @SimonLermenAI, @joshua_swans, @AerniMichael, Nicholas Carlini, @florian_tramer 🧵
3
29
3,719
Andreas 2022 had foresight 20/20 on the persona emulation concept and 0/20 on picking a name for the concept ("Language Models as Agent Models")
AI assistants like Claude can seem shockingly human—expressing joy or distress, and using anthropomorphic language to describe themselves. Why? In a new post we describe a theory that explains why AIs act like humans: the persona selection model. anthropic.com/research/perso…
1
27
2,092
Found the sigmoid!
7
9
351
21,208
Can LLMs figure out who you are from your anonymous posts? From a handful of comments, LLMs can infer where you live, what you do, and your interests; then search for you on the web. New 📄 w/ @SimonLermenAI, @joshua_swans, @AerniMichael, Nicholas Carlini, @florian_tramer 🧵
9
45
246
64,949
If you're anonymous, what should you do? Avoid sharing specific details, and adopt a security mindset: if a team of smart investigators were trying to identify you from your posts, could they plausibly figure out who you are? If yes, LLM agents will soon be able to do the same.
2
1
16
1,541
Privacy online is fundamentally at odds with intelligence getting cheaper. Anonymity on the internet has always relied on practical obscurity. We publish in hopes that people can adapt to LLMs changing this. Paper: arxiv.org/abs/2602.16800
2
4
23
1,435