tokenbender

tokenbender

1,897 Photos and videos

Tweets

Pinned Tweet

tokenbender

@tokenbender

Jun 7

We are releasing a fully reproducible early preprint of "Prism: Unlocking Language Model Capability Extraction". A trained language model knows many things at once, but deployment usually asks for one behavior at a time. Enterprise scenarios often have few products, workflows, features, or use-cases matter disproportionately. Prism asks and answers a simple question - "Is it possible to isolate and deploy only capabilities that are driven by Pareto principle and cut down costs by a huge margin while preserving most of the performance?" This paper discusses a novel approach to efficiency, understanding model behavior and opens up capability extraction.

210

21,694

tokenbender

tokenbender

@tokenbender

i wanted to post an update on research progress on capability extraction today but i just keep on improving it every day so yall just have to wait until i hit a short plateau.

421

tokenbender

tokenbender

@tokenbender

Jun 13

i actually do not care about Fable, it underperformed in my agentic silkroad that weaves my business, research, knowledge gathering/harvesting completely. if you try taking gpt 5.5 xhigh away, that is when I turn into the Joker.

Anthropic

@AnthropicAI

Jun 13

The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Claude models is not affected. We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible. Read our full statement: anthropic.com/news/fable-myt…

3,652

tokenbender

tokenbender

@tokenbender

Jun 11

tune your hparams hard. RNGesus is waiting for you in the loss valley somewhere.

Konstantin Mishchenko

@konstmish

Jun 11

I just submitted a PR to modded-nanogpt with better hyperparams. With them, Muon can reach the target loss after 3250 steps instead of 3325. Always tune your baseline well when doing research. Weak baselines can make any idea look promising

3,760

tokenbender

tokenbender

@tokenbender

Jun 11

we nearly do not have enough bounties and hack leaderboards for the AI-human collab era we live in. we need oai parameter golf like problems but 100x and every problem that gets solved, leads to a snowball. too much agent lottery, not enough recognition for creativity.

1,094

thebes

tokenbender retweeted

thebes

@voooooogel

18 Dec 2024

226

2,463

180,443

tokenbender

tokenbender

@tokenbender

Jun 10

oai does not need anything from anthropic and vice versa. it is the chinese companies/emerging neolabs that are being barred from potentially distilling this next step of intelligence. you, the peasant who just wants to vibe code and sip a drink, is just collateral damage.

3,209

tokenbender

tokenbender

@tokenbender

Jun 10

is there anybody else who gets annoyed more at how lame anthropic classifiers are than actually worrying about being barred access from some category?

959

tokenbender

tokenbender

@tokenbender

Jun 9

mythos version being 2x the price of opus means it is more like diet mythos or something.

Stephanie Palazzolo @steph_palazzolo

Jun 9

Scoop: A neutered version of Mythos called Claude Fable is coming today. It's expensive—2x the price of Opus—but perhaps not as pricey as people might have thought from the initial Mythos pricing (5x Opus). More on that and Apple WWDC in AI Agenda: theinformation.com/newslette…

2,602

Rohan Pandey

tokenbender retweeted

Rohan Pandey

@khoomeik

Jun 9

x dot com in 2026 is miles ahead of slopmaxxed academic peer review culture feels like im back in the 17th century watching newton & leibniz argue via public letters

rohan anil

@_arohan_

Jun 9

Here you go, sir. Muon is a good optimizer. I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get. This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home. The main delta's are below:

403

29,167

tokenbender

tokenbender

@tokenbender

Jun 9

don’t listen to doom scenarios, don’t let it sap your joy. find the juiciest problem you can within your means and wage war on it as jf every wish was granted.

578

tokenbender

tokenbender

@tokenbender

Jun 9

more than half of the year has passed and everything here seems to on-trend. everything except last 2 points, let's see how does the rest of the year go.

1,404

tokenbender

tokenbender

@tokenbender

Jun 9

we need more ai architecture beefs.

Keller Jordan

@kellerjordan0

Jun 9

Thank you for this result! Here's one initial correction: In this post, Rohan states that I made an attempt to implement Shampoo, and labels my Shampoo training run `Buggy shampoo impl`. But this is incorrect: I did not implement my own version of Shampoo. Instead -- as can be seen in the reproducible log I provided in the original post -- I used, out-of-the-box, the official DistributedShampoo implementation provided by Facebook research. This is the most commonly-used Shampoo implementation that I could find on the internet. The extent of my use of this implementation can be seen in the few lines of code below. If there are indeed bugs in this implementation, I can safely say that they were not created by me. The remaining mystery, for anyone interested, might naturally become something like the following - How did Rohan today achieve a significantly better result compared to the official 2022-era DistributedShampoo implementation? I am grateful for the comments he has already made regarding the deltas between his version and the 2022 one, and I am looking forward to things becoming fully precise/detailed soon once he releases the reproducible logfiles generated by his runs.

2,977

sam laki

tokenbender retweeted

sam laki @samlakig

Jun 8

why would you build something that you yourself would not use, fine-tune and perfect (whether you use AI or not). the more you use your tool, the more of your soul it imbibes, giving it a life of its own. this is a lesson i had to learn the hard way.

Boring_Business

@BoringBiz_

Jun 7

This is the tough lesson that a lot of people are learning the hard way AI might have made building apps a lot easier, but it also set the barrier to entry at zero Because anyone can do it, there is no moat left The only edge left in the future will be sales and marketing

3,444

tokenbender

tokenbender

@tokenbender

Jun 8

deep learning is scale and efficiency. since scale has been working, the incentive to go up has been higher than trying to be clever. as we cross the threshold where >90% current human work would be met by oss models itself and it would - then we cut cost by 1000x or more.

roon

@tszzl

Jun 8

Replying to @MatthewJBar

I think you’re wrong and there’s 1,000x efficiency gains leftover in deep learning research that could lead to much smarter faster more agentic models given the same inputs

1,684

secemp

tokenbender retweeted

secemp

@secemp9

Jun 7

this is great btw, because it aligns with the understanding that: - it's easy to notice that in both SFT and even during RL, gradient updates get sparser the longer the training gets, meaning we don't update the weights with exact knowledge of what is "wrong" or "missing", so we waste a lot of computation here - we have so many OSS models already, a lot with shared knowledge or unique ones in their weights, why not just find a way beyond model merging to deeply understand the way knowledge is stored and extract that and stack it in a cheaper way inside a single model? if done properly, at one point we wouldn't even need to pretrain anymore

tokenbender

@tokenbender

Jun 7

Replying to @tokenbender

Our future work aims to extend it further to more complex use cases and behaviors and create a new axis for model efficiency that can be stacked over existing methods with minimal loss. Code - github.com/e-xperiments/pris… Paper - github.com/e-xperiments/pris…

2,495

krishna

tokenbender retweeted

krishna

@OccupyingM

Jun 7

have been working on this with @tokenbender for the past two months PLEASE CHECK IT OUT. it so fucking cool you can litterally extract capabilites and run them on a fraction of the model. will write a blog on everything i learnt working with token and on this challenging problem soon

tokenbender

@tokenbender

Jun 7

4,050

tokenbender

tokenbender

@tokenbender

Jun 7

210

21,694

more replies

tokenbender

tokenbender

@tokenbender

Jun 7

In the one month since then, we cracked how to isolate function calling capabilities on 8B scale. On Qwen3-8B BFCL, raw recovery at the 36.2 percent substrate was 19.1 percent. With the right post-attribution objective, recovery rose to 84.6 percent at the same channel budget.

811

tokenbender

tokenbender

@tokenbender

Jun 7

GitHub - e-xperiments/prism-capability-extraction

Contribute to e-xperiments/prism-capability-extraction development by creating an account on GitHub.

github.com

3,784