took a quick look at this paper (just the convolution section) and I have several concerns about the claims:
1) pytorch by default does not execute synchronously on the GPU (host vs. device) and anyone who has forgotten syncs when benchmarking can tell you so
2) TF32 is enabled by default in cuDNN, enabling this is not an optimization
3) the above is also an example of tuning framework parameters rather than optimizing kernels themselves which I’m not sure is in the spirit of KernelBench, you’ll see many of the “custom_code” fields in the repo contain just modified pytorch code with no CUDA kernels at all!
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
Trains a DeepSeek-v3-671B model to optimize CUDA kernels using only execution-time speedup as reward.
Pipeline:
- SFT: Finetuned on 2.1K correct, executable CUDA variants from 6 LLMs across 250 KernelBench tasks
- Self-Supervised: Iterative REINFORCE updates on successful code (exec correctness)
- Contrastive-RL: Prompts include prior variants speedup scores; model compares, improves, and updates via GRPO
Results (KernelBench, 250 tasks):
- 17.7× avg speedup (A100), 449× max
- 100% success on complex ML workloads (Level 3), 50.8× mean
- Generalizes across GPUs: 19.0× (3090), 17.8× (H100), 13.9× (H20)
Findings:
- Discovers non-trivial strategies that multiply performance
- Identifies gatekeeper techniques (e.g. stream mgmt → unlocks CUDA graphs)
- Learns optimization dependencies (e.g., certain techniques must precede others)
- Outperforms both GRPO-only and evolutionary LLMs
- No reward hacking (via prompt constraints)