20 | Building a raw CUDA/PTX DeepSeek V4 inference kernel on B200 | Learning by building :)

Joined October 2025
9 Photos and videos
Btw, this video lecture on H100 by @_PrateekShukla_ is really amazing (link below) - it walks you through PTX, CUTLASS, and core GPU logic fundamentals in detail. Highly recommended even if you’re working on Blackwell or any other architecture. \(^o^)/ youtu.be/SqQUQHdYWyc?si=5jkN…

2
66
One really awesome thing about open-weight models: they’re so cheap that I have zero mental block about burning tokens or making my prompts super efficient. This is actually pushing me to experiment more and integrate AI deeper into my workflow. Pretty fun ngl :D
2
50
Working on a gemm kernel before proceeding further into deepseek internals
2
37
Today i walked through through Sliding Window Attention (SWA) by asking DeepSeek to extract only the SWA part from my toy decoder. Shared-KV MQA is really interesting. K and V are literally the same vector, so you store half the KV cache and skip one projection. The memory performance gains are huge, but I’m wondering how negligible the drop in context quality actually is. Also learned about attention sink - basically a learned “trash bucket” in the softmax that eats away weight from irrelevant tokens so they don’t steal attention. Ended the day looking at some kernel design primitives. __ ( o> ///\ \V_/_
1
78
Spent the last 2 days understanding the Deepseek v4 paper and mid way through the paper i came to the realisation i would be better of understanding the forward pass through a toy representation made with GPT and things started to click far more than just reading the paper in hindsight should've have leveraged GPT more and done this in the first place lol. Will start working on a simple kernel , now that it takes much less time :)
1
44
Learnt some CUDA/PTX for B200 and thought why not start building right now? So I’m building a full DeepSeek V4 inference kernel using only raw CUDA PTX. Just-in-time learning with LLMs the whole way. Daily progress learnings coming soon. Should be fun :)
2
56
22 Nov 2025
Challenge: Build my own simple malloc to understand C memory New to C, learning via Dan Luu’s tutorial: danluu.com/malloc-tutorial/ Surprise: stack & heap share the same virtual address space (see thread)! Any tips while building? #CProgramming #LowLevelDev .----. | o_o |

1
2
149
23 Nov 2025
Every allocation = hidden metadata user data. We return (metadata sizeof(meta)) to the user. Metadata holds: block size, free flag, and pointer to next block → a linked list of all chunks. Now we can walk the list, reuse free blocks, and actually implement free() .
1
2
109
27 Nov 2025
Now we can implement free()! To free(p): metadata = p - sizeof(meta); metadata.free = 1; The block is now reusable and in order to reuse this freed up space while allocating new space we also traverse the list to find any free blocks.

2
72
14 Nov 2025
printf("Hello World!"); .----. | o_o | | :_/ | / / \ \ ( | | ) /'\_ _/`\ \___)=(___/
2
76