Matt

Matt

49 Photos and videos

Tweets

Pinned Tweet

Matt

@Matthewagi

7 Nov 2025

Personalized Arxiv feed I built a system that allows you to create a personalized feed or search over recent listings. I use a two tower architecture with a preference model and a re-ranker. uror.io

Matt

@Matthewagi

4 Nov 2025

Got a nice kick in the butt yesterday to do something I've been putting off. Re-ranker for Arxiv articles is trained and ~industry standard

3,029

Matt

Matt

@Matthewagi

Jun 13

Potential soft pause which makes the new frontier price (if status quo holds)

@nrehiew_

Jun 13

So does this mean no one can ever release a model better than Fable now, lest they get banned/regulated. At the same time, you can’t really release a model and show that it loses to an existing model on all benchmarks…

116

Matt

Matt

@Matthewagi

Jun 13

x.com/i/status/2065651072836… But likely doesn't last long tbh

prinz

@deredleritt3r

Jun 13

Parsing this evening's events: - The U.S. government approved the release of Fable 5 to the public, clearly under the presumption that the model's cybersecurity capabilities cannot be accessed by hackers, authoritarian regimes, etc. - Recently (today?), "another company" showed the U.S. government that a jailbreak of Fable 5 *is possible*. Yes, a minor jailbreak - but how can a non-technical government official be assured that there aren't also other, more dangerous, jailbreaks in this model that won't be discovered by the CCP? - Anthropic states, completely correctly, that: "We suspect that perfect jailbreak resistance is not currently possible for any model provider. Every safeguard used in the industry is vulnerable to non-universal jailbreaks (which can elicit some cyber information in specific circumstances), and it is likely that universal jailbreaks will eventually be found in the future. We stated this clearly when we released Fable 5." - My best guess is that the U.S. government did not fully realize this at the time when the release of Fable 5 was approved. - Per Axios, the government contacted Anthropic and asked to "pause releasing the... models but was unsuccessful" - i.e., Anthropic told the government to pound sand. - Per Axios, this "prompt[ed] the export control letter". - Per Axios, the U.S. government is *NOT* looking to restrict access to Fable to U.S. nationals forever. "The model needs to remain locked down until the U.S. governent's national security apparatus is hardened", which "could happen in a few weeks". - I interpret Anthropic's reaction as challenging the government: "we believe the government should have the ability to block unsafe deployments, as part of a statutory process that is transparent, fair, clear, and grounded in technical facts. This action does not adhere to those principles." If the Axios article is correct, I do not think any other model providers have anything to fear based solely on this evening's events, because: (1) they would hopefully be smarter than downright rejecting a request by the U.S. government to pause releasing a model, and (2) they will be required anyway under the recent executive order to give the U.S. government at least 30 days to test the model for cybersecurity capabilities - during which time the U.S. government would also be able to shore up its own cybersecurity defenses with the same model. I remain extremely concerned that actions by one particular U.S. lab over the last few months might be moving us closer and closer to the scenario where at least that lab - and potentially all others - will be nationalized.

Matt

Matt

@Matthewagi

Jun 9

I don't think its me. Codex has been more proactive which is bad for engineering but good for 'car guy' agentic users. I find it constantly reaching for something to do. reaching for the most hyper engineered solution. I don't want this. there is a tension in agent systems around the idea of motion. consumers want to feel like things are getting done. they want to be moving and moving fast. Codex had gone too far in following directions exactly. now it has gone too far in moving without direction but with assumptions.

Matt

Matt

@Matthewagi

Jun 8

Codex side chats is how interactivity evolves. You don't need real time if you can spin up another instance with full access to the original models outputs. The UI and mechanisms need work but I could see an RLM formalization appear in this direction

Matt

Matt

@Matthewagi

Jun 7

I flew too close to the sun. My spot instance was taken

tokenbender

Matt retweeted

tokenbender

@tokenbender

Jun 7

We are releasing a fully reproducible early preprint of "Prism: Unlocking Language Model Capability Extraction". A trained language model knows many things at once, but deployment usually asks for one behavior at a time. Enterprise scenarios often have few products, workflows, features, or use-cases matter disproportionately. Prism asks and answers a simple question - "Is it possible to isolate and deploy only capabilities that are driven by Pareto principle and cut down costs by a huge margin while preserving most of the performance?" This paper discusses a novel approach to efficiency, understanding model behavior and opens up capability extraction.

211

21,967

Matt

Matt

@Matthewagi

Jun 6

Finally got my hear-s1.1 model on the MTEB leaderboard (audio only) and it ranks 12th overall! hear-s is only 22m parameters and is pareto frontier in audio encoding

595

Matt

Matt

@Matthewagi

Jun 6

I trained it for a hackathon (which I lost) but it's an interesting example because in open source niche you can fairly easily get to the pareto frontier

Matt

Matt

@Matthewagi

Jun 6

I've found a lot of signal in the finephrase work

Joël Niklaus

@joelniklaus

Jun 6

New FinePhrase result: the best synthetic-to-real ratio for pretraining isn't 50/50. Quick context: FinePhrase is our open 486B-token synthetic pretraining dataset. We take FineWeb-Edu web text, rephrase it with a small 1.7B model (SmolLM2) into four structured formats (FAQs, math problems, tables, tutorials), and then train on a mix of original and synthetic data. The whole recipe came out of 90 controlled pretraining experiments. The new question we tackled: how much of that mix should actually be synthetic? We swept the synthetic fraction from 10% to 90% for each format. Every format's optimum sits higher than the uniform 50/50, and it's format-dependent: tables peak at 70% synthetic, math at 80%, FAQ and tutorials at 60%. The curves climb to their peak and then plateau rather than collapsing, so there's a wide safe band and no sign of the "too much synthetic = model collapse" failure mode. This also sets a new state of the art among synthetic pretraining data. Our best config (tables at 70% synthetic) is 31% better and reaches the same quality 3.2x faster than REWIRE, the strongest rephrasing baseline, which used a 70B-parameter model. We get there with a 1.7B rephraser that also generates tokens roughly 30x cheaper. A caveat: these results are at small scale (1.7B parameters, 21B training tokens) so might not transfer to larger training runs. Read the updated playbook: huggingface.co/spaces/Huggin…

410

Matt

Matt

@Matthewagi

Jun 5

"GPUs of the AI, by the AI, for the AI, shall not perish from the Datacenter." Abraham Lincoln

Matt

Matt

@Matthewagi

Jun 4

Some interesting claims

Forecasting Research Institute

@Research_FRI

Jun 4

Is it possible to spot a good forecast by its rationale? We used LLMs to score the reasoning behind 55,000 forecasts and test the link between forecast accuracy and written rationales. We found that: • Causal reasoning is much more prevalent than statistical argumentation • It's easier to identify poor forecasters rather than excellent ones • Human ratings of rationale quality can be unreliable. 🧵A thread on the results:

Matt

Matt

@Matthewagi

Jun 4

Nerds should buy fruit gushers to make cluster gushers

Matt

Matt

@Matthewagi

Jun 3

sometimes you need to write your prompt like a transaction with a genie

Matt

Matt

@Matthewagi

Jun 2

this is pretty neat

Mixedbread

@mixedbreadai

Jun 2

By now, everyone knows that single-vector embedding models are hugely limiting for modern workflows. But they contain than you think: you can extract sparse Latent Terms from them. And it turns out that BM25 is all you need to turn this vocabulary into a strong retriever.

0:12

Matt

Matt

@Matthewagi

Jun 2

I've been coming back to arxiv.org/abs/2605.22297 because it rightfully points out that each layer may need different learning rates than a singular global rate. However, most of the reported impact might just be early embedded learning. If you smash Embed early and decay it to FFN over like 20% of your training and don't change the other layer LRs from the initial computation then it looks like you might get the same effect with no need to recalculate LRs during training. However this is interesting because you can generally estimate a global LR using hessian/critical curvature so you should be able to estimate layerwise now as well. If you're not pretraining you likely don't need to hit embedding so hard either which makes this very interesting to pursue a data-model method for LR determination without hyperparameter tuning.

159

Matt

Matt

@Matthewagi

Jun 1

really interesting. if the reasoning could be done in parallel then this can likely learn it

Lukas Aichberger @aichberger

Jun 1

We unlocked the working memory of LLMs 💥 Reasoning in Memory (RiM) replaces autoregressive "thinking out loud" with fixed memory blocks that form a task-specific workspace for latent reasoning. The key idea is simple: reasoning should happen inside the LLM, not in its output!

Matt

Matt

@Matthewagi

May 31

Matt

Matt

@Matthewagi

May 29

Why isn't ES as popular as RL? infra appears to be a big answer. If you look at async then you can get up to like 85% inference load. With clever ES you can get up over 95%. But there's no big ES libraries which is partly because of the social cascade of RL. They also find different signals. If you combine all those ideas then the 10% isn't compelling to earn attention and I bet isn't widely known. So you get engineering applied more focused on RL

1,100

Matt

Matt

@Matthewagi

May 27

This is where you end up in these types of games. There's a reward/effort curve and you just need to make the effort enough. The bars not high. It's still interesting that there isn't much demand outside of cheating for non-slop yet.

Max Spero

@max_spero_

May 27

Replying to @AndyAyrey @truth_terminal

if a student fine tunes their own open source LLM, then they deserve it

Matt

Matt

@Matthewagi

May 26

"You waste years by not being able to waste hours." except its days because I didn't check the AI's work