Rob Wiblin

Rob Wiblin

31 Photos and videos

Tweets

Jaechul Roh retweeted

Rob Wiblin

@robertwiblin

Jun 2

My best interview in some time. Rohin Shah leads AGI alignment/safety at DeepMind. And he has a lot of spicy personal takes: We probably won’t get catastrophic misalignment (00:49) Safety 'commitments' have severe limitations (10:38) The intelligence explosion probably isn't imminent (1:52:44) Why he's not working to pause AI advances (51:44) Pre-deployment evals aren't the right focus (for catastrophic risks) (37:41) Signalling concern for safety sometimes diverts resources from actually making AI safe (01:09:51) Reading AI thoughts is v useful for safety – and we'll probably be able to for years to come (54:17) Governance is somewhat more likely to be the bottleneck than alignment (43:55) Rohin's team doesn't have a veto, and that's OK (27:36) Central banks are a promising model for regulating AI (33:34) Also: Google DeepMind's actual plan for building AGI safely (1:40:29) How external researchers can positively influence big AI companies (2:21:55) The roles GDM most needs to hire for (2:37:03) On the 80,000 Hours Podcast. Links below - enjoy! (@rohinmshah)

2:48:27

848

153,460

Jaechul Roh

Jaechul Roh

@JaechulRoh

May 21

New preprint: Codec-Robust Attacks on Audio LLMs #CodecAttack Lossy codecs (Opus, MP3, AAC) have been treated as a defense against adversarial audio. We show they're actually an attack surface.

177

more replies

Jaechul Roh

Jaechul Roh

@JaechulRoh

May 21

Why does it survive? The latent perturbation concentrates 88% of energy below 4 kHz, exactly where codecs allocate the most bits. A Jacobian analysis confirms this is structural: the decoder has no basis functions above 4 kHz.

Jaechul Roh

Jaechul Roh

@JaechulRoh

May 21

Joint work with @Qualcomm. Huge thanks to @JeanMonteuuis, Jonathan Petit, and @houmansadr for an amazing collaboration! Paper: arxiv.org/abs/2605.20519

Codec-Robust Attacks on Audio LLMs

Prior attacks on Audio Large Language Models (Audio LLMs) demonstrated that carefully crafted waveform-domain perturbations can force targeted adversarial outputs. As a defense mechanism against...

arxiv.org

Jaechul Roh

Jaechul Roh

@JaechulRoh

May 11

We still listen to old songs not because they are the best recordings, but because they remind us of something. A place, a person, a feeling. There is usually something imperfect about them, and I think that imperfection is part of why they stay with us. My daily research is in AI security, but I have also been interested in a different kind of threat lately. Not a technical one, but a cultural one. Questioning myself: what happens when more of the music, art, and stories around us are AI-generated? Not whether they will be good or bad, but whether they will carry the same weight over time. My recent blog post explores that question through the lens of why imperfection matters, how it connects to memory, and what we might quietly lose if it disappears. It is a highly opinionated writing, not a research paper. Just a casual read. But it has been on my mind for a while and I wanted to share.

Jaechul Roh

Jaechul Roh

@JaechulRoh

May 11

📰 aisec.cs.umass.edu/blog/post…

Neel Nanda

Jaechul Roh retweeted

Neel Nanda

@NeelNanda5

12 May 2025

After supervising 20 papers, I have highly opinionated views on writing great ML papers. When I entered the field I found this all frustratingly opaque So I wrote a guide on turning research into high-quality papers with scientific integrity! Hopefully still useful for NeurIPS

276

2,612

339,458

Jaechul Roh

Jaechul Roh

@JaechulRoh

Apr 21

1/ Fine-tuning an Audio LLM on benign audio dataset pushed its jailbreak rate from 4.62% → 87.12%.No adversary. No harmful data. New paper 🧵

2,554

more replies

Jaechul Roh

Jaechul Roh

@JaechulRoh

Apr 21

7/ Good news: two simple defenses bring JSR back to near-zero. 🛡️ Distant filtering (training time): pick benign samples farthest from harmful embeddings 🛡️ System prompt (inference time): just tell the model to refuse Safety is fragile, but recoverable.

Jaechul Roh

Jaechul Roh

@JaechulRoh

Apr 21

Work done with @houmansadr 📄 Paper: arxiv.org/abs/2604.16659

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

Prior work shows that fine-tuning aligned models on benign data degrades safety in text and vision modalities, and that proximity to harmful content in representation space predicts which samples...

arxiv.org

Jaechul Roh

Jaechul Roh

@JaechulRoh

Mar 5

Excited to have contributed to this work during my internship at Brave. Turns out making AI agents more private also makes them more useful, up to 17.9% better task success. Paper: arxiv.org/pdf/2602.13516

Brave

@brave

Mar 5

AI agents that browse for us can perform a lot of tasks on our behalf, from booking reservations to filling out forms. Unfortunately, these agents have a serious privacy issue: oversharing users' personal information. Fixing this problem is key to making AI more effective.

114

Akshay 🚀

Jaechul Roh retweeted

Akshay 🚀

@akshay_pachaar

Mar 4

x.com/i/article/202919735942…

234

1,578

254,095