Swayam Singh

Swayam Singh

788 Photos and videos

Tweets

Pinned Tweet

Swayam Singh

@swayaminsync

19 Jun 2025

Strong version of you is dealing with all the inner demons silently, keeping all the chaos contained within you, hidden from the outside world. It'll get exhaustive sometimes and I am proud of you. Don't give up.

15,809

Swayam Singh

Swayam Singh retweeted

Swayam Singh

@swayaminsync

Jun 13

The following question was asked in Stanford's CS336's guest lecture with @realDanFu Question: How megakernel works when you have multiple GPUs communication in the loop? I did a similr thing in our recent work, that is each rank allocates its symmetric-memory buffer (i.e. each rank's buffer is mapped into a shared address space over NVLink and there's a multicast pointer that the NVLink fabric switch fans out to all peers' copies.) Two PTX primitives do the actual reduction in-hardware: - multimem. red => a rank writes its partial to all peers at once (fabric-side add into the multicast address). - multimem.ld_reduce.add => a rank reads from the multicast pointer and the fabric returns the sum across all ranks' copies in a single instruction. The barriers can be splitted, signal is hoisted early so the cross-rank wait overlaps with independent compute scheduled in the gap.

3,188

Swayam Singh

Swayam Singh

@swayaminsync

Jun 13

3,188

Swayam Singh

Swayam Singh

@swayaminsync

Jun 13

I particularly didn't know that NCCL calls can be fused in a MK (it will be really cool to know how 😃), so from my vanilla use I found multimem 4x better than NCCL's floor (obv at B=1)

337

Swayam Singh

Swayam Singh

@swayaminsync

Jun 13

Docs: docs.nvidia.com/cuda/paralle…

268

Swayam Singh

Swayam Singh retweeted

Swayam Singh

@swayaminsync

Jun 12

This is exactly why I'm building Cpp-Verify, and it's exciting to see @JaneStreetGroup embracing a similar philosophy around OCaml and formal methods. Verification tooling shouldn't live outside the language ecosystem, it should evolve alongside it. Cpp-Verify takes this approach by extending the C compiler stack (Clang/LLVM) itself, adding contracts and a dedicated verification IR that can target multiple backends. Today we're focused on Z3, with BMC and Lean serving as experimental backends. As AI makes code generation cheaper than ever, verification is rapidly becoming the bottleneck. The future belongs to language-integrated verification, not bolt-on tooling.

Yaron (Ron) Minsky

@yminsky

Jun 11

Our goals here are ambitious! Our hope is to make formal methods as pervasively useful of a tool for building software as sophisticated type systems are for us today. blog.janestreet.com/formal-m…

1,292

Swayam Singh

Swayam Singh

@swayaminsync

Jun 12

Yaron (Ron) Minsky

@yminsky

Jun 11

Our goals here are ambitious! Our hope is to make formal methods as pervasively useful of a tool for building software as sophisticated type systems are for us today. blog.janestreet.com/formal-m…

1,292

Swayam Singh

Swayam Singh

@swayaminsync

Jun 12

Public release by the start of July!!

134

Swayam Singh

Swayam Singh

@swayaminsync

Jun 5

Begin: 149 ms/tok

504

more replies

Swayam Singh

Swayam Singh

@swayaminsync

Jun 11

12.45 ms/tok and I am stopping now, other things need more attention. Will prepare its OSS release soon

Swayam Singh

Swayam Singh

@swayaminsync

Jun 11

Okay 11.5 ms/tok That's it last, won't do anymore!!

Swayam Singh

Swayam Singh

@swayaminsync

Jun 10

Warp-4 decided to be different than the crowd and ended up spinning forever. Bao-Wao old friend!

310

Swayam Singh

Swayam Singh retweeted

Swayam Singh

@swayaminsync

Jun 9

Just recalled something, in my last year's work on NextCoder. We found that moving base was sub-optimal as compare to the fixed base. (Which produces the reasoning of constrained KL-div between the base and ckpt, leading to benefits, kinda artifcat parallel to RL) But inspired from Dino-V1, what if using the moving base but instead of big ckpt jump, we take an exponential moving average and doing SeleKT updates as per that. I don't have compute and time both, so if anyone wants to take this, feel free to.

328

Swayam Singh

Swayam Singh

@swayaminsync

Jun 9

I envy people who can use agents effectively. In all of my tasks, it just keep saying: "This is a huge write-up", "needs more than 3-4 weeks". So I used /goal and after 2 iterations it is telling me that "I can't do in this session, please do /goal clear" 😂😂

453

Swayam Singh

Swayam Singh retweeted

Swayam Singh

@swayaminsync

Jun 8

13.75 ms/tok (reached @vllm_project 's latency with compilation enabled, block-scaled fp8) And my kernel still having 10 idle clusters! Going ahead will be the gains (no CLC, it will raise more design questions, will try split-k first)

362

Swayam Singh

Swayam Singh

@swayaminsync

Jun 8

Claude discovered a precision bug that stayed more than 2 years in production. It happened because our test cases did not include a huge power of 2 number. Will be soon advocating to ditch test-cases based correction and instead jump to formal gurantee methods.

419

Swayam Singh

Swayam Singh

@swayaminsync

Jun 8

I am really enjoying working in high-precision settings.

Swayam Singh

Swayam Singh

@swayaminsync

Jun 7

They added agentic-skills to the Z3. Much needed!! github.com/Z3Prover/z3

288