GK Yedhu

GK Yedhu

9 Photos and videos

Tweets

GK Yedhu

@gkdev10

May 6

Btw, this video lecture on H100 by @_PrateekShukla_ is really amazing (link below) - it walks you through PTX, CUTLASS, and core GPU logic fundamentals in detail. Highly recommended even if you’re working on Blackwell or any other architecture. \(^o^)/ youtu.be/SqQUQHdYWyc?si=5jkN…

GK Yedhu

GK Yedhu

@gkdev10

May 4

One really awesome thing about open-weight models: they’re so cheap that I have zero mental block about burning tokens or making my prompts super efficient. This is actually pushing me to experiment more and integrate AI deeper into my workflow. Pretty fun ngl :D

GK Yedhu

GK Yedhu

@gkdev10

May 4

Working on a gemm kernel before proceeding further into deepseek internals

GK Yedhu

GK Yedhu

@gkdev10

May 1

Today i walked through through Sliding Window Attention (SWA) by asking DeepSeek to extract only the SWA part from my toy decoder. Shared-KV MQA is really interesting. K and V are literally the same vector, so you store half the KV cache and skip one projection. The memory performance gains are huge, but I’m wondering how negligible the drop in context quality actually is. Also learned about attention sink - basically a learned “trash bucket” in the softmax that eats away weight from irrelevant tokens so they don’t steal attention. Ended the day looking at some kernel design primitives. __ ( o> ///\ \V_/_

GK Yedhu

GK Yedhu

@gkdev10

Apr 30

Spent the last 2 days understanding the Deepseek v4 paper and mid way through the paper i came to the realisation i would be better of understanding the forward pass through a toy representation made with GPT and things started to click far more than just reading the paper in hindsight should've have leveraged GPT more and done this in the first place lol. Will start working on a simple kernel , now that it takes much less time :)

GK Yedhu

GK Yedhu

@gkdev10

Apr 28

Learnt some CUDA/PTX for B200 and thought why not start building right now? So I’m building a full DeepSeek V4 inference kernel using only raw CUDA PTX. Just-in-time learning with LLMs the whole way. Daily progress learnings coming soon. Should be fun :)

GK Yedhu

GK Yedhu

@gkdev10

22 Nov 2025

Challenge: Build my own simple malloc to understand C memory New to C, learning via Dan Luu’s tutorial: danluu.com/malloc-tutorial/ Surprise: stack & heap share the same virtual address space (see thread)! Any tips while building? #CProgramming #LowLevelDev .----. | o_o |

149

more replies

GK Yedhu

GK Yedhu

@gkdev10

23 Nov 2025

Every allocation = hidden metadata user data. We return (metadata sizeof(meta)) to the user. Metadata holds: block size, free flag, and pointer to next block → a linked list of all chunks. Now we can walk the list, reuse free blocks, and actually implement free() .

109

GK Yedhu

GK Yedhu

@gkdev10

27 Nov 2025

Now we can implement free()! To free(p): metadata = p - sizeof(meta); metadata.free = 1; The block is now reusable and in order to reuse this freed up space while allocating new space we also traverse the list to find any free blocks.

GK Yedhu

GK Yedhu

@gkdev10

14 Nov 2025

printf("Hello World!"); .----. | o_o | | :_/ | / / \ \ ( | | ) /'\_ _/`\ \___)=(___/