Roger Wang

Roger Wang

6 Photos and videos

Tweets

Kyle Kranen retweeted

Roger Wang

@rogerw0108

Jun 4

🎉It's finally here! You can also deploy the model disaggregated with Dynamo too - check it out! github.com/ai-dynamo/dynamo/…

vLLM

@vllm_project

Jun 4

🚀 Day-0 support for NVIDIA Nemotron 3 Ultra on vLLM! Ready to be served with the latest vLLM stable release, the new open frontier reasoning model is built for long-running autonomous agents: 🧠 550B total / 55B active — Hybrid Transformer-Mamba MoE 📚 Up to 1M token context ⚡ NVFP4 BF16 🛠️ Tool calling, coding, deep research, orchestration Read our detailed model launch blog and recipes! recipes.vllm.ai/nvidia/NVIDI…

1,157

Kyle Kranen

Kyle Kranen

@KranenKyle

Jun 3

If one had 4 inoperative V100 SXM2 GPUs, what would one do with them? Asking for a friend…

754

Hao Kang

Kyle Kranen retweeted

Hao Kang

@GT_HaoKang

Jun 1

Excited to share that ThunderAgent has been integrated into NVIDIA Dynamo as an experimental router for agentic workloads! ThunderAgent was designed to schedule at the granularity of agent runs, making agentic serving/rl upto 4x faster! Huge thanks to @0xishand , @KranenKyle , and the Dynamo team. They have been exceptionally efficient and proactive — the team had already started pushing this forward even before I officially joined @nvidia . Looking forward to seeing ThunderAgent ideas further evolve within Dynamo. And thanks for the help from @togethercompute Link: github.com/ai-dynamo/dynamo/… @simran_s_arora @Chenfeng_X @_weilix @yinfang_chen #AI #MLsys #Agent #Nvidia

15,723

Kyle Kranen

Kyle Kranen

@KranenKyle

May 31

Alt use: use this to prototype new algorithms for scheduling in Dynamo and tag us! We’ve used this successfully to design some improvements with auto research and are just starting to push into how far this can go!

ishan

@0xishand

May 31

Replying to @0xishand

If folks are trying to break into inference -> see if you can figure out where we differ from engines and fix it. Send me PR showing our perf before and after compared to the engine

2,242

ishan

Kyle Kranen retweeted

ishan

@0xishand

May 31

If folks are trying to break into inference -> see if you can figure out where we differ from engines and fix it. Send me PR showing our perf before and after compared to the engine

2,880

Kyle Kranen

Kyle Kranen

@KranenKyle

May 30

About a month ago I posted about ongoing work on datacenter scale inference simulation. People seemed to like it so we wrote more about it! Check out this awesome blog post from the Dynamo team!

NVIDIA AI

@NVIDIAAI

May 30

There's a better way to serve your inference stack, you just haven't found it yet. DynoSim is a workload-driven simulation of the Dynamo serving stack that turns exhaustive deployment search into a simulate-then-verify loop. Instead of testing every deployment choice, teams can model the whole stack on one virtual timeline, screen thousands of configurations in high fidelity simulation, then validate only the best candidates on real hardware. And because it's a full Rust implementation, it runs extremely fast. In our testing, 1,500x faster than real time.

2,213

NVIDIA AI

Kyle Kranen retweeted

NVIDIA AI

@NVIDIAAI

May 30

371

57,494

Kyle Kranen

Kyle Kranen

@KranenKyle

May 30

If you legitimately have hit RSI, is it better to minimize latency or maximize throughput for your inference? My mental model leans towards maximizing throughput, as A) parallel search is a thing and B) real world interactions tend to cause an Amdahl’s Law problem.

2,185

elie

Kyle Kranen retweeted

elie

@eliebakouch

May 28

"/goal rewrite jax in rust"

Elon Musk

@elonmusk

May 28

SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible. The potential speed improvement vs JAX for large training runs is over an order of magnitude.

518

29,172

Schwinn

Kyle Kranen retweeted

Schwinn @szawinis

May 28

Super excited to share one of the projects I’ve been leading over the past few months. We can now get a single-GPU gpt-oss-120b vLLM instance up in under 5 seconds after container start. Next stop: multi-GPU checkpoint/restore!

NVIDIA AI

@NVIDIAAI

May 27

Introducing Dynamo Snapshot, our approach for fast startup for inference workloads on Kubernetes, which reduces startup time from minutes to under 5 seconds. In production inference deployments demand fluctuates over time. Cold-starting inference workloads can take minutes, leaving idle GPUs that generate no tokens and serve no requests. Snapshot leverages GMS to enable concurrent weight restoration over a high-speed interconnect, while using Linux native AIO and parallel memfd restoration to accelerate CRIU restore performance.

3,792

Kyle Kranen

Kyle Kranen

@KranenKyle

May 28

Cold starts are super painful for scaling LLM workers. Check out our work at restoring inference workers (including AOT traces) in seconds, not 10s of minutes!

NVIDIA AI

@NVIDIAAI

May 27

6,336

ishan

Kyle Kranen retweeted

ishan

@0xishand

May 27

Replying to @kylekuzma

@kylekuzma curious to get your thoughts on disaggregated serving and kv cache offloading to CPU/SSD?

1,051

Kyle Kranen

Kyle Kranen

@KranenKyle

May 27

Ran into a familiar name :)

1,414

Kyle Kranen

Kyle Kranen

@KranenKyle

May 25

A good memory from the LLama 3 days is that one of the best drafters for 405B was the 1B model because they were trained on the same distribution of data :)

Gabriele Berton

@gabriberton

May 24

In speculative decoding the drafter just needs to be as fast as possible and its predictions be matching the verifier's I heard of drafters that are 2-layers 4B models, can anyone confirm?

8,924

Kyle Kranen

Kyle Kranen

@KranenKyle

May 24

Ever wished you could adapt your EP size on the fly for fault tolerance of scaling purposes? Now you can with NIXL-EP! Check it out:

2,963

Kyle Kranen

Kyle Kranen

@KranenKyle

May 24

vllm.ai/blog/2026-05-14-elas…

Elastic Expert Parallelism in vLLM

Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maxi

vllm.ai

291

Kyle Kranen

Kyle Kranen

@KranenKyle

May 24

I’m really excited by the amount of non bipedal/quadruped home robotics projects these days The humanoid/animal form factors are well-studied, but leave so much on the table! Give me a refrigerator sized food replicator or self-loading dishwasher!

1,035

Kyle Kranen

Kyle Kranen

@KranenKyle

May 19

No matter the amount of full attention sparsity, decreased attention dim, MLA/GQA, infinite long context will always have to deal with quadratic prefill cost. Kind of begs the question on if we can successfully train a model with local prefill and global decode?

3,782

Kyle Kranen

Kyle Kranen

@KranenKyle

May 17

High fidelity performance simulation allows you to do some really cool stuff, especially auto research! Sim is a first-class citizen in Dynamo! On a MacBook, you can run simulations of 1000 GPUs l against real a trace in virtual time 1000x faster than real Check it out:

2,724

Kyle Kranen

Kyle Kranen

@KranenKyle

May 17

docs.nvidia.com/dynamo/user-…

334