Thx and to be real, I see this type of work very differently because of where it’s coming from. If approximately the same paper came from almost anywhere else, I would have comments about 1) expanding the related works section, 2) making the actual mechanics of the method clear in the abstract, and 3) talking about how SAEs are not a unique way of learning about your datasets an how it affects your model and that a limitation of the paper is that it did not pit the method against an unsexy equal effort control involving manual or automated dataset exploration to find issues. But that would be it.
But the paper is from GoodFire, and on one hand, I hold it to a different standard because of the people and money it has. On the other hand, I’m against what GoodFire has become as a VC backed for profit company with a product to sell using GIFs and McGrathian marketing to convince non-scientists that GF is doing things that are much more impressive than they are. This often includes safetywashing. Back in 2023 I read Eric Ho’s white paper for GF, got the ick, and told him that I thought that the absolute last thing that the epistemic of the interpretability community needed was a big company that sucks up a bunch of researchers to spin, market, and sell their work for profit. And not to sound a certain way, but Eric then proceeded to do exactly what I had worried about.
I like your work. I just don’t like where you work. It’s worth taking a second to acknowledge how much it sucks that GoodFire can raise over 1 billion while academic labs like Lakkaraju’s, Bau’s, Geva’s, Tegmark’s etc. do less conflicted and typically better work, with far less recognition and far less money, far more efficiently. The difference isn’t research quality. It’s the rich and shameless big tech and venture capital stuff. Unfortunately, I think it’s clear that GoodFire’s leadership is adapting to trade epistemic responsibility for exploiting that nonsense and miseducating its audience.
(And don’t get me wrong, I have related thoughts on GDM, OAI, and Anth.)
IDK, how do you feel about being one of the authors on a paper in which figure 4 seemed to be an advertisement for a venture capital backed tech product?