Darshan

Darshan

125 Photos and videos

Tweets

Pinned Tweet

Darshan

@neuronfitting

Jan 22

it seems optimization in cs is just doing data transfer/manipulation on chunks of data

409

Darshan

Darshan

@neuronfitting

process for generating knowledge is virtually indistinguishable from process for generating speech

Naval

Darshan retweeted

Naval

@naval

19h

Science is not a process, a credential, or an institution. It is the unflinching pursuit of truth, carried out by the few, co-opted by the many.

695

1,880

12,619

845,729

Paras Chopra

Darshan retweeted

Paras Chopra

@paraschopra

19h

I see a lot of enthusiasm about building sovereign models on my timeline. That's great to hear and India needs it, BUT.. building a Fable-class model is a compute and funding game. Last I checked, India had ~50-100k H100 equivalents while frontier labs would have a million each. Unless we have a paradigm shift in how AIs are trained, the conversation ought to be happening about amount of funding available to do what we want to do. Show me an Indian company that's secured funding/compute in the same range as that of Chinese AI labs (let alone American labs). Without compute, what will happen is what has happened before: we'd promise to shake the world and then build models that are a year or two behind the top ones. The path forward for sovereign models that I see is to invest in basic R&D so we have a chance to go beyond the current paradigm, OR the government pooling in several orders of magnitude more compute to seriously commit competing at par.

939

58,446

Darshan

Darshan

@neuronfitting

Jun 13

wtf?

Anthropic

@AnthropicAI

Jun 13

The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Claude models is not affected. We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible. Read our full statement: anthropic.com/news/fable-myt…

Darshan

Darshan

@neuronfitting

Jun 9

To feel the burning itch of curiosity requires both that you be ignorant, and that you desire to relinquish your ignorance.

Darshan

Darshan

@neuronfitting

Jun 8

opposite of happiness isn't sadness, its boredom.

Darshan

Darshan

@neuronfitting

Jun 8

People should get smarter at a rate sufficient to integrate their old experiences, but not so much smarter so fast that they can't integrate their new intelligence. Being smarter means you get bored faster, but you can also tackle new challenges you couldn't understand before.

Darshan

Darshan

@neuronfitting

Jun 5

"What I cannot create, I do not understand." Introducing: The Feynman GPU Lectures. Your H100s and B200s are running at a fraction of their peak utilization because your custom kernels are written with massive hardware bottlenecks. If you don't know what tcgen05. mma does at the wire level, you're lighting compute efficiency on fire.

more replies

Darshan

Darshan

@neuronfitting

Jun 5

Register files used to be the ultimate bottleneck for Tensor Core accumulators. Introducing Blackwell’s Tensor Memory (TMEM), a completely new address space inside the SM that isolates the accumulator entirely from the register file.

Darshan

Darshan

@neuronfitting

Jun 5

Introducing your new modern GPU blueprint. Read the full post here: dcbaslani.xyz/blog/gpu_maste…

The Feynman GPU Lectures

A GPU masterclass that builds from transistors and CUDA cores up through SM architecture, memory systems, Tensor Cores, Hopper, and Blackwell.

dcbaslani.xyz

Darshan

Darshan

@neuronfitting

Jun 1

Open source is catching up to fronteir labs!

MiniMax (official)

@MiniMax_AI

Jun 1

Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax Sparse Attention scales context to 1M - Natively Multimodal from Step Zero API: platform.minimax.io Token Plan: platform.minimax.io/subscrib… 🚀New! MiniMax Code: code.minimax.io Weights & Tech Report in ~10 Days

Darshan

Darshan

@neuronfitting

May 28

This is crazy work!!!!

Elon Musk

@elonmusk

May 28

SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible. The potential speed improvement vs JAX for large training runs is over an order of magnitude.

Darshan

Darshan

@neuronfitting

May 25

Just managed to 5x the inference throughput of Qwen 3.5 on a B200 (from ~16 tok/s to ~83 tok/s) by ripping out PyTorch overhead and building custom fused kernels. I wrote a full deep dive on how to move from memory-bound PyTorch to compute-bound

244

Darshan

Darshan

@neuronfitting

May 25

Looking at the initial profiler traces, the compute cores were starving. Standard PyTorch separates residual hidden_states and RMSNorm, forcing the GPU to do multiple round trips to HBM just to execute basic math.

Darshan

Darshan

@neuronfitting

May 25

Phase 1: I built a custom Triton kernel for Qwen's Zero-Centered RMSNorm and re-architected the HF model to pass the residual stream continuously across layers (vLLM style). That eliminated the memory boundary and netted a 5.7x latency reduction on the norm.

Darshan

Darshan

@neuronfitting

May 21

crazy work! wonderful writeup!

steve

@gpusteve

May 20

we recently optimized qwen3.5-397b-a17b to be the fastest deployment publicly hosted. and the crazy thing: we did it by writing CUSTOM KERNELS for AMD MI355x. 🍿 see our post below outlining how we optimized kernels to achieve SOTA performance.

307

Darshan

Darshan

@neuronfitting

May 19

Multiverse of Madness

Andrej Karpathy

@karpathy

May 19

Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

Darshan

Darshan

@neuronfitting

May 18

i still don't understand why they use inverted X-axis!!!

Cursor

@cursor_ai

May 18

Replying to @cursor_ai

Composer 2.5 is exceptionally intelligent and up to 10x more efficient than similarly capable models.