Building
@jazz_tools v2 in public, Ep 3
Main learning: ground your LLMs in your concrete problem, build jigs, religiously optimize code iteration speed.
The last couple days were more perf work, driven by a specific target that a commercial use case gave us.
Optimization un/fortunately has always been extremely addictive to me: a giant puzzle with a clear reward function.
LLMs are good at it for the same reason. They especially help with the tedious parts: run the benchmark, interpret profiles, correlate with code impl, small thesis, small experiment, rebuild, remeasure, repeat.
This already worked well for our synthetic benchmarks in our monorepo, but here we were particularly interested in perf in a concrete adopter app, in their own repo, under NDA, etc.
And what mattered here was end-to-end performance including the jazz server and their use of jazz in their app’s frontend. For a while I was manually getting the app in the right state, profiling, then feeding the profiles to codex. I was the slowest element in the process. Annoyingly, browser automation was similarly slow.
So I set up a private meta-repo that had links to a jazz worktree and a worktree of the adopter app. Made codex set up a small jig that directly imported their app code and ran only the perf-relevant part in bun, bypassing all the app setup around it.
This shortened the iteration cycle to around 10s (plus Rust build times) and meant codex could reliably iterate on the problem autonomously.
In one afternoon of using this setup we found more optimization opportunities than I had in the entire week before, speeding up one particular load pattern by about 24x
I’m now using this same jig to optimize cache re-use of subqueries across similar queries, especially for row-level-security policy evals, allowing me to try different designs super quickly. And again, both performance and correctness are grounded in adopter apps that actually are more demanding and intricate than any synthetic benchmarks we could have come up with this early.
The longer-term goal is to extend this to as many adopter apps as possible and to automate measuring the effect of changes to jazz on real apps even more.
Our North Star here is the equivalent of the “crater run” of the Rust compiler that is run over a large amount of (all?) public rust crates, being a comprehensive check of compile success and performance over the broad ecosystem.