Hrishbh Dalal

Hrishbh Dalal

12 Photos and videos

Tweets

Cade Daniel 🇺🇸 retweeted

Hrishbh Dalal

@HrishbhDalal

25 Mar 2025

What if we could teach LLMs to be algorithm inventors? I trained an LLM to improve sorting algorithms through pure reinforcement learning - and it discovered optimizations giving 47.92x speedups over an optimized python based Timsort baseline! No cold-start data needed. I used @huggingface grpo implementation and @Alibaba_Qwen 7b model.

780

107,181

Jonathan Frankle

Cade Daniel 🇺🇸 retweeted

Jonathan Frankle

@jefrankle

25 Mar 2025

The hardest part about finetuning LLMs is that people generally don't have high-quality labeled data. Today, @databricks introduced TAO, a new finetuning method that only needs inputs, no labels necessary. Best of all, it actually beats supervised finetuning on labeled data.

134

890

90,754

Simran Arora

Cade Daniel 🇺🇸 retweeted

Simran Arora

@simran_s_arora

25 Mar 2025

BASED ✌️ turns 1! One year since its launch at NeurIPS 2023 — and it's helped shape the new wave of efficient LMs. ⚡️ Fastest linear attention kernels 🧠 405B models trained on 16 GPUs 💥 Inspired Mamba-v2, RWKVs, MiniMax Checkout our retrospective below!

107

22,445

Hongyang Zhang

Cade Daniel 🇺🇸 retweeted

Hongyang Zhang @hongyangzh

21 Mar 2025

Jointly announcing EAGLE-3 with SGLang: Setting a new record in LLM inference acceleration! - 5x🚀than vanilla (on HF) - 1.4x🚀than EAGLE-2 (on HF) - A record of ~400 TPS on LLama 3.1 8B with a single H100 (on SGLang) - 1.65x🚀in latency even for large bs=64 (on SGLang) - A new scaling law: more training data, better speedup - Apache 2.0 Paper: arxiv.org/abs/2503.01840 Code: github.com/SafeAILab/EAGLE SGLang version: github.com/sgl-project/sglan… ⚒️Takeaway: Introducing training-time test, a novel draft model training technique: we replace feature prediction with direct token prediction and shift from top-layer-only features to multi-layer feature fusion. This approach unlocks a new scaling law previously undiscovered in EAGLE and EAGLE-2. 🙏Acknowledge: We would like to thank the SGLang team (@zhyncs42 @lm_zheng @ying11231 @JamesLiuID, @ispobaoke, and others @lmsysorg) for their merge and careful evaluation of EAGLE-3 on SGLang. 🤝Want to collaborate? We're a small academic group with limited GPU resources. If you're interested in supporting our next version of EAGLE or would like us to train a preliminary version tailored to a specific model, please get in touch! Joint work with Yuhui Li, Fangyun Wei, and Chao Zhang

0:13

299

41,769

Shanli Xing

Cade Daniel 🇺🇸 retweeted

Shanli Xing @shanli_xing

11 Mar 2025

🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…

180

31,388

Simon Guo

Cade Daniel 🇺🇸 retweeted

Simon Guo

@simonguozirui

25 Feb 2025

LLMs for GPU kernel🌽generation have been getting Pop🍿ular since our preview last Dec; excited to announce 📢 our full paper 📃 for KernelBench! Turns out KernelBench is quite challenging 🧠 — frontier models outperform the PyTorch Eager baseline <20% of the time. More 🧵👇

305

114,031

Cade Daniel 🇺🇸

Cade Daniel 🇺🇸

@cdnamz

11 Feb 2025

Welcome @istoica05

Hao Zhang

@haozhangml

11 Feb 2025

Thrilled to see @istoica05 joining X and couldn't agree more with his insights on the importance of shared infrastructure. "Open source" encompasses more than just open weights—it includes open data, open artifacts, and open infrastructure!

896

Cade Daniel 🇺🇸

Cade Daniel 🇺🇸

@cdnamz

24 Jan 2025

Congrats!

Deli Chen

@victor207755822

24 Jan 2025

Unbelievable results, feels like a dream—our R1 model is now #1 in the world (with style control)! 🌍🏆 Beyond words right now. 🤯 All I know is we keep pushing forward to make open-source AGI a reality for everyone. 🚀✨ #OpenSource #AI #AGI #DeepSeekR1

685

Grad

Cade Daniel 🇺🇸 retweeted

Grad

@Grad62304977

20 Jan 2025

People waking up to take their bitter lesson pill x.com/rm_rafailov/status/188…

Rafael Rafailov @ NeurIPS

@rm_rafailov

20 Jan 2025

DeepSeek R1 with "Cold Start" pretty much works as expected. I still don't buy the R1 Zero result, the base models barely output coherent solutions without finagling. My bet is there is some correction/reflection/backtracking-like data in mid-training.

8,631

Cade Daniel 🇺🇸

Cade Daniel 🇺🇸

@cdnamz

27 Nov 2024

love finding bangers so damn good they force a follow

779

Suhail

Cade Daniel 🇺🇸 retweeted

Suhail

@Suhail

18 Nov 2024

Once the AI labs realize they need to make products for survival, they will immediately reformulate their strategy to competing with the most obvious working thing that is vaguely under the guise of the original mission. You should presume you will be ruthlessly copied.

768

106,412

Rohan Choudhury

Cade Daniel 🇺🇸 retweeted

Rohan Choudhury

@rchoudhury997

15 Nov 2024

Excited to finally release our NeurIPS 2024 (spotlight) paper! We introduce Run-Length Tokenization (RLT), a simple way to significantly speed up your vision transformer on video with no loss in performance!

169

1,427

155,993

Vima Gupta

Cade Daniel 🇺🇸 retweeted

Vima Gupta @vima_gupta

15 Nov 2024

1/7 🧵 MoEs: A tale of expectation vs reality Marketing: "Only compute the expert parameters you need!" Reality: Batch 16 requests → ALL experts activate At serving time (vLLM/TGI), arithmetic intensity: AI ≈ (num_tokens * top_k) / total_experts In simpler terms: Your decode arithmetic intensity scales inversely with expert count 🤔 #MoE #LLMs #ChatGPT #Claude #vllm #AI #ML

3,150

𝚐𝔪𝟾𝚡𝚡𝟾

Cade Daniel 🇺🇸 retweeted

𝚐𝔪𝟾𝚡𝚡𝟾 @gm8xx8

15 Nov 2024

Pie: Pooling CPU Memory for LLM Inference paper: arxiv.org/abs/2411.09317 Pie is an LLM inference framework that tackles the memory challenges of large models by enabling efficient GPU-CPU memory swapping and adaptive expansion. It optimizes memory usage without increasing latency, achieving up to 1.9x higher throughput and 2x lower latency compared to alternatives like vLLM, while reducing GPU memory usage by up to 1.67x.

Pie: Pooling CPU Memory for LLM Inference

The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill...

arxiv.org

169

11,014

Michael Matthews

Cade Daniel 🇺🇸 retweeted

Michael Matthews @mitrma

11 Nov 2024

🍎 The core of Kinetix is our new 2D rigid body physics engine: Jax2D. This is a minimal rewrite of the classic Box2D engine made by @erin_catto. Jax2D allows us to run thousands of heterogeneous parallel environments on a single GPU (yes, you can vmap over different tasks!) 8/

3,417

Jerry Tworek

Cade Daniel 🇺🇸 retweeted

Jerry Tworek

@MillionInt

2 Nov 2024

ARR is the only meaningful AGI metric

11,646

vLLM

Cade Daniel 🇺🇸 retweeted

vLLM

@vllm_project

22 Oct 2024

Speculative decoding is one of the best tool in the vLLM's suite of inference optimization tool box, accelerating the inference without accuracy loss. Checkout our blog post for more details about the state of spec decode in vLLM today! 🧵 blog.vllm.ai/2024/10/17/spec…

How Speculative Decoding Boosts vLLM Performance by up to 2.8x

Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll

vllm.ai

235

30,683

Andreas Köpf

Cade Daniel 🇺🇸 retweeted

Andreas Köpf

@neurosp1ke

17 Oct 2024

If you are interested in the latest GPU MODE news (upcoming lectures, videos etc.) please follow our new official twitter/x account: @GPU_MODE

2,126

Simran Arora

Cade Daniel 🇺🇸 retweeted

Simran Arora

@simran_s_arora

14 Oct 2024

Want Llama 405B, but wish it scaled linearly in sequence length??? Enter LoLCATS: an efficient method for "turning Transformers to linear attention models", all on an academic budget!! We use LoLCATS to linearize the *full Llama 3.1 model family* for the first time – 20 points of improvement on 5-shot MMLU over prior methods with only 0.2% of past methods’ model parameters and 0.4% of their training tokens! Had sooo much fun working with @mzhangio and @HazyResearch on this!

643

100,272

Arthur Douillard

Cade Daniel 🇺🇸 retweeted

Arthur Douillard

@Ar_Douillard

14 Oct 2024

KV Prediction for Improved Time to First Token LLM inference can be split in two phases: Prefilling and Decoding. The decoding phase is in autoregressive mode, where tokens are generating one by one, by re-using previous Key/Value tensors in the KV-cache. To speed up that process, we can use speculative decoding (arxiv.org/abs/2302.01318) where a small draft model samples quickly many tokens, and a bigger scorer model check from time to time, in a single forward, if the draft looks ok. KV Prediction for Improved Time to First Token (arxiv.org/abs/2410.08391) proposes a similar thing, but for decoding, where a smaller model will predict the KV-cache, per layers, of the big model. This is quite useful with extremely long context, where this initial prefilling contains tons of pdf or videos. The problem is that while in speculative decoding, the small and big models share the same token logits space, in the prefiling of the KV-cache, the dimensions of the tensors are different between the two models. So it wouldn't work out of the box. The paper proposes to train a linear projection per layer, to go from the small model space to the larger model space. Flops is reduced with less performance degradation (see KVP-C and KVP-LP). The prompt length seems however quite small, so it'd be worth investigating how well it can scale, in the hay in the needle in a haystack benchmark from Anthropic (anthropic.com/news/claude-3-…, ctrl f "recall").

205

18,394