HRM-Text paper is here:
sapientinc.github.io/HRM-Tex…
Just finished reading it as a deeper dive. I went in with a connected set of researcher-style questions:
Is the gain really from HRM? To answer that, we first need to separate out the objective: how much comes from computing loss only on response tokens? Then, how much comes from PrefixLM? Finally, if we remove PrefixLM, how strong is causal-only HRM?
What I appreciate is that the paper gives enough ablations to answer this chain pretty directly.
1/ First, there is real architectural signal.
-------------------------------------------------
Table 3 compares model architectures, objectives, and attention masks, under same FLOPs budget. The average scores are:
Transformer, P(x), causal 41.9
HRM, P(x), causal 51.5
Transformer, P(a|q), causal 57.6
HRM, P(a|q), causal 62.3
Transformer, P(a|q), PrefixLM 65.3
HRM, P(a|q), PrefixLM 73.4
So even under causal attention, HRM still wins over the matched Transformer.
2/ That said, I would read the final headline number carefully. It is not “HRM architecture alone.” It is:
-------------------------------------------------
HRM architecture
response-only objective
PrefixLM
instruction/reasoning-heavy data
2.1/ PrefixLM is a big piece. In PrefixLM, prompt tokens can attend bidirectionally, while answer tokens are still generated autoregressively.
So the prompt side becomes somewhat encoder-like, while the answer side stays decoder-style.
Empirically:
Transformer:
causal response-only -> PrefixLM
57.6 -> 65.3 ( 7.7)
HRM:
causal response-only -> PrefixLM
62.3 -> 73.4 ( 11.1)
This is a strong improvement, but it also raised my first deployment concern.
In multi-turn chat, bidirectional prompt attention means you need special mask / KV-cache handling. You cannot simply treat it as the usual append-only causal cache in AR models. To their credit, the paper explicitly discusses this in Sec. 5.3. I appreciate that they state this explicitly
2.2/ The objective also matters a lot.
> Standard LM objective: learn P(x)
> Task-completion objective: learn P(answer | question)
In practice, this means: do not spend loss predicting the prompt. Train on response tokens.
This alone moves the average:
Transformer: 41.9 -> 57.6
HRM: 51.5 -> 62.3
3/ Then, the next sharp question for me was: What if we take causal-only HRM from Table 3 and compare it to the open models in Table 4?
-------------------------------------------------
Not the final PrefixLM HRM. Just causal-only HRM. That gives a less flashy, but more informative comparison.
Against Table 4 models, causal-only HRM roughly looks like this:
HRM causal avg: 62.3
vs Llama3.2 3B: 3.1
vs Gemma3 4B: 5.7
vs Qwen3.5 2B: 4.1
vs Huginn 3.5B: 21.2
vs Ouro 1.4B: -1.4
vs OLMo3 7B: -7.3
So causal-only HRM is still quite competitive for a 1B low-budget model (actually impressive if you look at FLOPs used compared to others in Table 4!).
4/ FLOPs fairness is important, and the paper makes a serious attempt here.
-------------------------------------------------
For the internal Transformer, TRM, and HRM comparisons, they match estimated training FLOPs, not just token count.
Because HRM spends more computation per token, the Transformer gets more training tokens under the same FLOPs budget.
I think this is a serious attempt at fairness, though still estimate-level.
-/ In summary, HRM-Text is a solid work to me. The ablations show real architectural signal, in addition to recipe choices separate from arch.
That is more interesting than just a one-line architecture claim, and more useful for researchers to follow up on.
Congratulations on the team!