Behind the scenes of mni-ml:
January 4th 2026 - my roommate
@MankyDankyBanky and I wanted to do a big project together.
”maybe we should try to build pytorch from scratch”
We found
@srush_nlp's minitorch curriculum and committed to grinding through it Jan to April.
February - autodiff and tensor internals done. lots of late night PR reviews, stacked diffs, Kinton ramen runs to Toronto when I'd visit Aadi at Shopify. We started posting on X to keep ourselves accountable.
March - the month of parallelization: Aadi shipped tiled matmul using the same algo
@nvidia teaches in their CUDA guide, wrapped by end of month - pooling, conv1d/2d forward backward, softmax, dropout.
March 22-23 —
@socraticainfo symposium & we see the tinytpu team on the stage which filled us with determination 🫡 cc:
@evanliin @XanderChin @suryasure05 @kennykgguo
March 24 - chose the mni-ml brand and started the educational blog
March 30 - minitorch is DONE ahead of schedule. now we build on top of the framework.
April 5-6 - cuBLAS matmul via koffi FFI. buffer pooling, strided batched GEMM, kernel optimizations. CUDA backend takes shape.
April 7 - huge day. cross-platform CI pipeline, prebuilt npm binaries, v0.3.0 — CUDA live on
@npmjs. flatten the monorepo, add
@WebGPU Windows CUDA build targets by eod.
April 12 - flash attention CUDA kernel ships. we caught a bug where head dim > 32 was truncating.
April 14 (during exam season), we recorded the demo in
@Shopify recording studio during Aadi’s lunch break. Everything over the last 4mo finally came together. Cc:
@fnthawar @tobi @alspee
April 17: launch post and bought the domain
mni.ml and we’re just getting started. We have so much in store for this summer, stay tuned 🫡
cc:
@sundeep @GavinSherry
ALT the birth of @mni-ml/framework
I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more.
Wrote the full transformer architecture, and BPE tokenizer from scratch.
The framework features:
- Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput
- Automatic WebGPU fallback for non-NVIDIA devices
- TypeScript API with Rust compute backend
- One npm install to get started, prebuilt binaries for every platform
Try out the model for yourself:
mni-ml.github.io/demos/trans…
Built with
@_reesechong. Check out the repos and blog if you want to learn more.
Shoutout to
@modal for the compute credits allowing me to train on 2 A100 GPUs without going broke
cc
@sundeep @GavinSherry