Oscar Balcells Obeso

Oscar Balcells Obeso

5 Photos and videos

Tweets

Pinned Tweet

Oscar Balcells Obeso @OBalcells

9 Sep 2025

Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.

0:15

204

610

8,662

746,750

Sam Bowman

Oscar Balcells Obeso retweeted

Sam Bowman

@sleepinyourhat

Apr 7

(I encountered an uneasy surprise when I got an email from an instance of Mythos Preview while eating a sandwich in a park. That instance wasn't supposed to have access to the internet.)

264

2,417

396,078

Leo Gao

Oscar Balcells Obeso retweeted

Leo Gao

@nabla_theta

Feb 28

Replying to @boazbaraktcs

- what happens when the model/safety stack refuses DoW queries? if the DoW gets mad and strongarms openai, like they just did to anthropic, how is openai going to resist? especially if openai doesn't even have the strong contractual protection

132

4,268

Anthropic

Oscar Balcells Obeso retweeted

Anthropic

@AnthropicAI

Feb 26

A statement from Anthropic CEO, Dario Amodei, on our discussions with the Department of War. anthropic.com/news/statement…

Statement from Dario Amodei on our discussions with the Department of War

A statement from our CEO on national security uses of AI

anthropic.com

4,239

9,205

55,530

16,674,074

roon

Oscar Balcells Obeso retweeted

roon

@tszzl

Feb 4

it’s just so clear humans are the bottleneck to writing software. number of agents we can manage, information flow, state management. there will just be no centaurs soon as it is not a stable state

172

2,000

208,703

Neel Nanda

Oscar Balcells Obeso retweeted

Neel Nanda

@NeelNanda5

24 Dec 2025

I'll be accepting late applications to my summer MATS stream until Jan 2nd! If you want to do mech interp research supervised by me, please apply

Neel Nanda

@NeelNanda5

20 Nov 2025

My Summer MATS applications are open! You'll do full-time research on a mech interp paper supervised by me. Due Dec 23. All backgrounds welcome! I've supervised 40 papers (17 at top conferences), but projects still get better each time. I'm excited for what's next! Highlights:

163

19,641

Ethan Perez

Oscar Balcells Obeso retweeted

Ethan Perez

@EthanJPerez

11 Dec 2025

Fellows grads have started to get a reputation as some of the steepest trajectory researchers at Anthropic. So we’re excited to expand the program and help mentor more new AI safety researchers

Anthropic

@AnthropicAI

11 Dec 2025

We’re opening applications for the next two rounds of the Anthropic Fellows Program, beginning in May and July 2026. We provide funding, compute, and direct mentorship to researchers and engineers to work on real safety and security projects for four months.

418

46,407

Leo Gao

Oscar Balcells Obeso retweeted

Leo Gao

@nabla_theta

5 Dec 2025

New post: An Ambitious Vision for Interpretability Understanding is essential for ensuring things don't break unexpectedly. AMI is a big risky bet, but so is all ambitious research. AMI is tractable: it has good empirical feedback loops, and we've already made a lot of progress.

240

55,368

Neel Nanda

Oscar Balcells Obeso retweeted

Neel Nanda

@NeelNanda5

1 Dec 2025

The GDM mechanistic interpretability team has pivoted to a new approach: pragmatic interpretability Our post details how we now do research, why now is the time to pivot, why we expect this way to have more impact and why we think other interp researchers should follow suit

667

250,374

Oscar Balcells Obeso

Oscar Balcells Obeso @OBalcells

29 Nov 2025

👀

Adam Karvonen

@a_karvonen

29 Nov 2025

Can you trust your LLM inference provider? What about your own infrastructure? Inference problems are everywhere. We introduce Token-DiFR, a simple solution. It can easily detect when inference has degraded (like bugs or hidden quantization) with no provider overhead.

540

@levelsio

Oscar Balcells Obeso retweeted

@levelsio

23 Oct 2025

🇪🇺 As a European citizen and AI founder, I can apparently use these "AI Factories", so I just signed up to use them! Every "supercomputer" has an [ ACCESS NOW ] button which made me very excited I expected to sign up, maybe pay a discounted H100 rate (funded by EU, that'd be nice?) and get a Jypyter notebook, or some SSH login so I can access my GPU like I'd do on @lambdaapi or @awscloud or @Hetzner_Online But I celebrated to early, I signed up, confirmed my email, then ended up in a "Supercomputer Access Calls" page, where I had to select from a tedious list of "Call For Proposals" to get access to a GPU So I could NOT just access a H100 GPU, I have to make sure my project (in this case my business) fits a specific proposal, ok fair This process was already tedious enough but then when I tried to actually go through with it, it started asking me if I had "Respect for Human Agency?", I do I think, and if I was mindful of "Individual, and Social and Environmental Well-Being?", well I am, right guys??? Right??? The questions didn't stop, just endless pages of this Look I get what they're doing, they pivoted the classic university "I need to rent a giant computer for my research" to an EU wide thing and then present it as the "European AI plan" But this isn't really how AI works in production? As a founder in AI, if I wanna do stuff I'd rent a whole bunch H100 GPUs again at @lambdaapi or @awscloud or @Hetzner_Online and SSH into a box Or if I want it more simple I run AI models on @FAL, @wavespeed or @replicate which is just an API call or web front end I can click stuff and run a model The EU has the right intentions here but it's just the wrong execution, this thing will 100% go nowhere, and I'm a born optimist, I want to believe, I'm also a proud European, and I'm in AI a bit and not a complete idiot. There's just better ways to do this If you really want to have the GPU servers in Europe (which arguably isn't that important), then let me rent a GPU box with SSH access at @Hetzner_Online or @OVHcloud that's hosted in Europe and subsidize that for European citizens and European businesses. I don't even believe in that, but at least that'd make it accessible for Europeans. Now it really isn't? What's REALLY much more important though if you want to be a part of the AI race and I've posted for years here with @euaccofficial is to make Europe a really extremely attractive place to start and run an AI business. Remove regulatory obstructions and give tax discounts for startups. Let them build a business first that can compete worldwide and once they make enough money (let's say $100M/y), then slowly start adding regulation. Because right now the regulation only benefits the European incumbents, the dinosaur companies, while making it very difficult for European citizens to start new AI companies here. Which is why we literally have none left. Anyway, I applied to get my GPU, let's see if I get it!

5:23

@levelsio

23 Oct 2025

What in the F is an AI factory? I had to investigate what the unelected @EU_Commission is talking about today So according to them, it's some data centers (which they call supercomputers) in 6 different EU countries I checked out the most powerful one: Karolina, a Czech data center, it mostly has CPUs though (see pic) not GPUs, so mostly useless for AI The GPUs it does have are 72x 8x NVIDIA A100 GPU, so 576x A100, or equivalent of 240x H100s (H100 is about 2.4x the compute power of A100) So let's compare that: @xAI has 200,000x H100 GPUs So the xAI data center has 800x more compute than the Czech one If we combine xAI, Meta, AWS, etc. it's about 750,000 H100s If we assume the other 5 data centers in the EU are equivalent to the Czech one (which is massive stretch because most of the others seem AI consultacny services, they don't even HAVE chips!), the EU's new "AI factories" have a total of 1,440x H100 GPUs, let's round up to 1,500 to be nice So the EU is trying to compete with 750,000 GPUs with their own 1,500 GPUs, so 500x less?? Correct me if I'm wrong but it's just seems very low impact and another ridiculous idea and burning of EU tax payers money that will end up in local cronies and bureaucrats and will do NOTHING to improve the AI business climate for Europe The best way to improve it is to deregulate, make it super easy and low tax (especially when starting out) to start AI companies in Europe

389

459

4,716

1,467,780

Andy Arditi

Oscar Balcells Obeso retweeted

Andy Arditi @andyarditi

13 Sep 2025

We found "misaligned persona" features in Llama and Qwen that mediate emergent misalignment. Fine-tuning on bad medical advice strengthens these pre-existing features, causing broader undesirable behavior. lesswrong.com/posts/NCWiR8K8…

Finding "misaligned persona" features in open-weight models — LessWrong

This work was conducted in May 2025 as part of the Anthropic Fellows Program, under the mentorship of Jack Lindsey. We were initially excited about t…

lesswrong.com

13,856

Andy Arditi

Oscar Balcells Obeso retweeted

Andy Arditi @andyarditi

10 Sep 2025

Wouldn't it be great if chat models could indicate their uncertainty as they write? Our new paper is a concrete step towards this vision, using internal representations to predict hallucination risk in real-time.

Oscar Balcells Obeso @OBalcells

9 Sep 2025

Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.

0:15

24,346

Neel Nanda

Oscar Balcells Obeso retweeted

Neel Nanda

@NeelNanda5

9 Sep 2025

I'm excited that, this year, interpretability finally works well enough to be practically useful in the real world! We found that, with enough effort into dataset construction, simple linear probes are cheap, real-time, token level hallucination detectors and beat baselines

Oscar Balcells Obeso @OBalcells

9 Sep 2025

Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.

0:15

112

1,613

119,981

Oscar Balcells Obeso

Oscar Balcells Obeso @OBalcells

9 Sep 2025

Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.

0:15

204

610

8,662

746,750

more replies

Oscar Balcells Obeso

Oscar Balcells Obeso @OBalcells

9 Sep 2025

An unexpected finding: when we train LoRA probes with minimal regularization, models become more epistemically cautious, sometimes acknowledging they've hallucinated immediately after doing so. We only train to predict binary hallucination labels from its own hidden states—no direct optimization for output behavior. Yet the model spontaneously learns to hedge its claims (e.g., adding "I cannot confirm this" right after making incorrect factual assertions).

160

21,043

Oscar Balcells Obeso

Oscar Balcells Obeso @OBalcells

9 Sep 2025

📄 Paper: arxiv.org/abs/2509.03531 💻 Code: github.com/obalcells/halluci… 🌐 Website: hallucination-probes.com This work was done in collaboration with @andyarditi, @javifer_96, @xjoshfree, @CameronHolmes92, and @NeelNanda5.

Real-Time Detection of Hallucinated Entities in Long-Form Generation

Large language models are now routinely used in high-stakes applications where hallucinations can cause serious harm, such as medical consultations or legal advice. Existing hallucination...

arxiv.org

324

18,425

Ryan Greenblatt

Oscar Balcells Obeso retweeted

Ryan Greenblatt

@RyanPGreenblatt

3 Sep 2025

I'm skeptical of claims that some specific advance will cause very above trend AI progress in the next year. Ongoing big improvements (that seem huge from the inside) are already priced into the longer running trend. In a new post, I argue this applies to better RL env quality.

Ryan Greenblatt

@RyanPGreenblatt

3 Sep 2025

x.com/i/article/196308941478…

6,032