Our GPU stack for both NVIDIA and AMD, aside from minimal pieces of signed firmware, is 100% open source and pure Python except for the compiler. It's not using vendor drivers, frameworks, or libraries. That's why it's so easy to make it work on Mac.
For compilers, on AMD, we use upstream LLVM, and on NVIDIA, we use the NAK compiler from the MESA project. We plan to replace the compiler with pure tinygrad in a year or two as well.
With RANGEIFY merged, our lowering stuff now matches the state of the art, TVM style. We're studying ThunderKittens and TileLang for speed at that level, and should have all this stuff ready in 200 days for the due date of our AMD Llama 405B training contract.
Due to tinygrad's small size and pure Python nature, it's the easiest ML library to make progress on, aka fastest slope of improvement. With Megakernel style for scheduling, MODeL_opt style for planning, and E-graph style for symbolic, we should blow past the state of the art in PyTorch and JAX speed.
If we do that, NVIDIA's moat is over. It's 1000 lines at most to add a new accelerator to tinygrad. And I don't mean to add a new accelerator with help from a kernel driver, compiler, and libraries. Just 1000 lines of software for the *whole* accelerator speaking right on the PCIe BARs, like what tinygrad is doing with the NVIDIA and AMD GPUs now.