Thanks for sharing the data!
The cost gap is mostly output tokens. JSON is structurally heavier than markdown and output is 8x the cost of input on GPT-5.2. The variance you're seeing is the LLM making different design choices each run
Things that'll help: trim your catalog down to just what you need (fewer components = simpler output), build higher-level components so the LLM doesn't have to assemble 10 primitives when one composite would do, and use customRules to cap element count. Also make sure prompt caching is kicking in. The system prompt is the same every call and cached input is 90% cheaper
Another pattern worth trying: keep GPT-5.2 as your reasoning model and have it tool-call a smaller cheaper model (mini/nano) to actually generate the JSON. The 5.2 decides what to build, the small model outputs it
More generally json-render works with any model. A curated catalog of high-level components a cheaper model could land under your markdown-with-GPT-5.2 costs
We'll keep refining this to improve both cost efficiency and consistency