Joined August 2025
44 Photos and videos
Pinned Tweet
Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.
34
47
910
161,831
Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.
34
47
910
161,831
Public benchmarks saturate quickly and inevitably leak into training data, with none quite resembling the work our engineers do every day. Building our own benchmark has allowed us to evaluate models within our own financial software ecosystem. We compared models side by side and unearthed their behavioral differences. Head to head breakdowns available here: labs.ramp.com/swebench
2
2
67
9,246
When measuring effectiveness versus cost, the frontier presents as a tradeoff rather than a single winner. Read our methodology and explore the results below: labs.ramp.com/swebench
6
6
105
54,665
We deployed 10,000 background agents to security-scan our codebase. The system is simple, scales with compute, and runs on publicly available models. From the scan, we fixed several high-severity vulnerabilities.
22
21
459
70,671
The scan pipeline is model-agnostic, and does not require a frontier model to drive it. We evaluated several models against our confirmed vulnerabilities, and found that cheaper open-weight models still surface high-severity issues.
3
3
51
4,589
We built this to earn trust from Ramp customers, who rely on us for their cards, expenses, and payments. If you have a background coding agent, you can build a similar scan for your customers. Full article: x.com/RampLabs/status/205967…

2
39
5,178
We partnered with @PrimeIntellect to build Fast Ask, a small RL-trained subagent that helps our Sheets agent find answers in spreadsheets. It scores 4% over Opus on exact match accuracy at Haiku latency.
26
49
741
328,497
This was a good fit for RL because spreadsheet retrieval is repeated often, latency sensitive, and has clean feedback. The model either returns the right cent amount, date, invoice ID, yes/no, or row reference, or it does not. That let us optimize the retrieval policy directly with deterministic rewards.
2
1
61
9,196
We built a synthetic RL environment with 14 finance task types, gave the model 3 tools and 15 turns, and let it learn how to navigate workbooks on its own. Information retrieval was a huge bottleneck for our spreadsheet agent, fast ask helped solve this. Full writeup: x.com/RampLabs/status/205244…

2
55
6,561
At Ramp, we've seen AI token spend skyrocket 13x among our customers since last January. We ran experiments where coding agents managed their own token budgets. They ignored them completely, so we employed a separate controller model to approve spend on their behalf.
37
7
184
36,767
Controllers consistently followed unverified advice over the coding agent’s work right in front of them. Even with a warning that the advice might be wrong, accuracy was well below a coin flip for most models. Only one condition produced accurate decisions across the board: grounding the controller with hard numbers.
3
14
3,802
AI token spend is climbing fast as companies put agents into real workflows. Don’t let agents decide how much they should spend. Track, forecast, and control AI spend by team, model, and project → ramp.com/ai-cost-monitoring
7
2,677
Introducing Latent Briefing, a way for agents to quickly share their relevant memory directly. Result: 31% fewer tokens used, same accuracy. Multi-agent systems are powerful, but can be wildly inefficient. They pass context as tokens, so costs explode and signal gets lost. We built an algorithm that allows agents to communicate KV cache to KV cache.
37
92
1,771
669,751
We ran RLM on LongBench v2 across various document lengths and difficulty levels, observing a 30% median token reduction with a consistent 3% accuracy boost. We also found that the optimal compaction level is dynamic: Longer documents benefit from lighter compaction, while harder tasks require more aggressive filtering.
1
58
15,150
Conceptually, this is a bit like taking notes. Sometimes you’re trying to build a body of knowledge over time, and the details matter because they accumulate into something larger. In those cases, you want to preserve context rather than compress it too early. With harder problems you’re often sketching ideas, exploring directions, following threads that may or may not lead anywhere. Most of what gets written down in that process isn’t meant to last. Latent briefing = saving time and money 😎 Full write up: x.com/RampLabs/status/204266…

1
4
73
12,962