Geoffrey Irving

Geoffrey Irving

11 Photos and videos

Tweets

Jacob Hilton retweeted

Geoffrey Irving

@geoffreyirving

Jun 10

We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵

137

942

181,017

Jacob Hilton

Jacob Hilton @JacobHHilton

Jun 4

ARC and @aicrowdHQ are launching a ≥$100k contest for white-box estimation algorithms: given the weights of an MLP, the goal is to estimate the expected output of the network on Gaussian inputs. (Thread)

11,359

more replies

Jacob Hilton

Jacob Hilton @JacobHHilton

Jun 4

It's rare to find good, easily-measurable metrics for progress in alignment. But we are cautiously optimistic that top submissions will produce ideas that meaningfully advance our research.

397

Jacob Hilton

Jacob Hilton @JacobHHilton

Jun 4

For more discussion, see our announcement post: alignment.org/blog/announcin…

Announcing the ARC White-Box Estimation Challenge

ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and later...

alignment.org

356

METR

Jacob Hilton retweeted

METR

@METR_Evals

May 19

Could an AI company lose control of its own agents? To find out, Anthropic, Google, Meta, and OpenAI let us (1) test their best internal models with CoT access, (2) review non-public info about capabilities, alignment, and control. The result: our first Frontier Risk Report.

193

918

348,323

Steven Adler

Jacob Hilton retweeted

Steven Adler

@sjgadler

May 19

Some personal news: I've started a new AI safety standards org, and our first two standards are out today. We're called Guidelight, co-founded with fellow ex-OpenAI safety researcher, Page Hedley. (1/n)

531

58,953

Jacob Hilton

Jacob Hilton @JacobHHilton

May 7

Can you estimate the average behavior of a neural network without running it? In ARC's latest paper, we address this question for wide randomly-initialized MLPs with Gaussian inputs. (Thread)

7,415

more replies

Jacob Hilton

Jacob Hilton @JacobHHilton

May 7

Our current approach works at the start of training, but we have a lot more work to do to produce methods that work throughout training (even for small models like the AlgZoo models we shared a few months ago).

602

Jacob Hilton

Jacob Hilton @JacobHHilton

May 7

Congrats to first author Wilson Wu, as well as my other coauthors @vclecomte, Mike Winer, George Robinson and @paulfchristiano. Paper: arxiv.org/abs/2605.05179 Blog post: alignment.org/blog/mechanist…

Estimating the expected output of wide random MLPs more...

By far the most common way to estimate an expected loss in machine learning is to draw samples, compute the loss on each one, and take the empirical average. However, sampling is not necessarily...

arxiv.org

1,120

Dwarkesh Patel

Jacob Hilton retweeted

Dwarkesh Patel

@dwarkesh_sp

Feb 4

Seems like a great opportunity for technical talent to come into government and help the USG make sound, technically informed decisions on AI

Samuel Hammond 🦉

@hamandcheese

Feb 3

CAISI is hiring for a bunch of exciting new roles, from partnerships to technical experts in AI x bio / chem and more. They're serious about bringing in strong researchers & engineers and letting them do good work. Based in DC or SF: nist.gov/caisi/careers-caisi

144

49,806

Jacob Hilton

Jacob Hilton @JacobHHilton

Jan 26

A challenge to the mechanistic interpretability community: fully interpret our 432-parameter RNN. (Thread)

556

64,331

more replies

Jacob Hilton

Jacob Hilton @JacobHHilton

Jan 26

Read more on ARC's blog: alignment.org/blog/algzoo-un… Play around with the models using the AlgZoo GitHub repo: github.com/alignment-researc… Cross-postings for comments at LW/AF

AlgZoo: uninterpreted models with fewer than 1,500 parameters

This post covers work done by several researchers at, visitors to and collaborators of ARC, including Zihao Chen, George Robinson, David Matolcsi, Jacob Stavrianos, Jiawei Li and Michael Sklar....

alignment.org

5,416

Jacob Hilton

Jacob Hilton @JacobHHilton

Jan 26

Thanks to Zihao Chen, George Robinson, David Matolcsi, Jacob Stavrianos, Jiawei Li and Michael Sklar for work on this and other 2nd argmax models.

4,246