Recently when I was trying to increase the rollout throughput of my RL fine tuning pipeline, I noticed that the GPU stayed idle for long periods of time instead of actually serving the LLM.
After profiling, I realised there are several coldstart issues when you try to serve a model on vLLM (inference engine).
Two largest contributors to coldstart are vLLM inspect and torch.compile.
vLLM inspect - This inference engine supports a lot of different architectures and models - dense, MoE, multi-modal, speculative decoding, BF16, FP8, etc.
In order to create a reasonable executing plan it has to inspect all the layers of the model it is running, - layer shapes, attention heads, rotary embeddings, hidden dimensions, KV structure.
vLLM precomputes KV cache sizes, block allocation strategy, paged attention metadat, batching scheduler limits.
For large models like gpt-oss- 120b this becomes substantial.
Next is torch.compile = PyTorch compiles model graphs into optimized kernels. Most of the time it is pretty hard to beat these kernels on performance basis (although if you are good at GPU programming, can beat).
But in order to generate these optimised kernels, torch takes substantial time as it has to observe tensor shapes, control flow, operator patterns to generate stable graphs. These graphs are then used by the compiler to fuse matmuls, layernorm, activations and attention ops into fewer kernels. This is obviously expensive.
My next goal is to reduce this time as much as possible. Perhaps by techniques like cuda checkpointing and snapshots.
Will update with progress.