devlog; I have onnx models running in my ML framework, again. But now with graph capture not just eager execution. I'm trying to roughly model it after PyTorch. I'm doing something similar to TorchFX, doing symbolic tracing with fake tensors (using names instead of ids, no memory allocations, ...) to record ops.
Tensor ops execute eagerly, and are traced if run within `Graph::capture()` (I used to hide this by doing some wiring behind the scenes and use thread locals to store state). During capture, the ops record nodes into a graph. Tensors are aliased so recorded nodes refer to named tensors. When graphs run, they create "guards" which store metadata that is later used for checking inputs and outputs (names, data types, shapes, etc. kinda like in TorchDynamo). Optimization currently just involves pruning dead nodes, some validations and topological sort.
I have a custom CUDA allocator that can reuse blocks, which is useful if you know what and when to allocate ahead-of-time.
After that, it gets messy. There's a "compile" step that chooses "providers" (cuDNN, cuTENSOR, or custom kernels compiled with nvrtc) that graphs are executed with. These libraries require some annoying maintenance and aren't that straightforward to use. For the "TorchInductor" part I only have ideas not something that can run end-to-end yet.