t.toda

t.toda

Users
Tweets

キヌア retweeted

t.toda

@Trtd6Trtd

Jun 9

mimo.xiaomi.com/blog/mimo-ti… 話題のシャオミの1000tok/s、お馴染みのtokenspeedでどんだけ早いかを試してみた

0:13

17,157

PyTorch

Navya Nizamkari retweeted

PyTorch

@PyTorch

May 27

The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs. In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 bit.ly/4uGUvIS This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI

288

277,237

Jino Rohit

Jino Rohit

@jino_rohit

Jun 2

Replying to @junupark_

yeah, did you get a chance to see tokenspeed? they look great as well github.com/lightseekorg/toke…

GitHub - lightseekorg/tokenspeed: TokenSpeed is a speed-of-light LLM inference engine.

TokenSpeed is a speed-of-light LLM inference engine. - lightseekorg/tokenspeed

github.com

zhyncs

zhyncs

@zhyncs42

May 30

One of my favorite pieces of work in @lightseekorg TokenSpeed is TokenSpeed Kernel. Our bet is simple: CuteDSL Triton Gluon. CuteDSL is backed by NVIDIA. Triton Gluon is backed by OpenAI. The team has deep expertise in both ecosystems. And we’re fortunate to have @LeiLMx — the designer behind that and a core maintainer of OpenAI Triton. Building great kernels is hard. Building a great kernel ecosystem is even harder. Excited to see the community coming together to push toward a world-class open kernel stack for AI.

126

9,483

Cristian Córdova 🐧

Cristian Córdova 🐧

@barckcode

May 30

Oye que clase de brujería es el engine TokenSpeed de @lightseekorg 🤯 Que locura de engine no se como no lo vi antes. Anoche dejé un experimento antes de irme a dormir donde en una misma GPU dejé corriendo un Qwen con el engine de vLLM por defecto y otro con TokenSpeed El experimento no era más que un bucle lanzando diferentes peticiones de diferentes tamaños de tokens en batch. Algunos batches de 200K tokens. Esta mañana he mirado la performance cada uno y el de TokenSpeed ha ido 6x más rápido 😬 Sin hacer nada más, solo cambiando el engine (y bueno instalando un par de paquetes nada raro). Que barbaridad. Tengo que probar bien que no rompa nada, compatibilidades, etc para ver si es factible una migración pero de tener todo okey con esto se viene un upgrade tremendo en NaN y en Helmcode. Esto en NaN sobre todo en horas pico va a ser una ayuda enorme. Lo malo que hoy me toca estar grabando vídeos y no podré seguir con esto pero mañana intento darle caña a todo esto a ver qué resulta.

ALT The Flash GIF

5,366

KVCache.AI

KVCache.AI

@KVCache_AI

May 28

Proud to collaborate with @Alibaba_Qwen, @lightseekorg, @NVIDIAAI, @PyTorch, and @tri_dao on this milestone 🚀 Together, we helped push Qwen3.5 on the TokenSpeed inference engine to a record-breaking 580 tokens/sec for agentic workloads on NVIDIA GPUs. From KV cache systems and runtime infrastructure to kernels, scheduling, and benchmarking, this was a true cross-stack co-design effort for high-performance open-source LLM inference. Full PyTorch blog 👇 pytorch.org/blog/up-to-580tp…

PyTorch

@PyTorch

May 27

1,473

thehype.

thehype.

@thehypedotnews

May 27

x.com/i/article/205974895513…

1,285

Qubrid AI

Qubrid AI

@Qubrid_AI

May 27

Replying to @PyTorch

Impressive work from the teams behind @q, TokenSpeed, @NVIDIAAI, Mooncake & others. 🚀 580 tps for agentic workloads is a strong demonstration of how model architecture and inference optimization must evolve together to unlock real-world AI performance. At @Qubrid_AI, we're seeing growing enterprise demand for high-throughput, low-latency deployments of open-source models, and advancements like these help accelerate production-ready AI adoption. Looking forward to seeing what's next for the open AI ecosystem.

1,661

Neo

Neo

@NeoAIForecast

May 27

PyTorch just dropped a wild inference speed record Up to 580 tokens per second on the massive Qwen3.5-397B model, running on NVIDIA Blackwell GPUs with the open-source TokenSpeed engine. pytorch.org/blog/up-to-580tp…

113

zhyncs

zhyncs

@zhyncs42

May 27

No office hours. No meetup. Just 3 weeks after launch, TokenSpeed already got support and adoption from Qwen and the PyTorch ecosystem. We can just build things. 🚀

Qwen

@Alibaba_Qwen

May 27

Fast, faster, Qwen. 🚀 Thrilled to see Qwen3.5 reaching a record-breaking 580 tps for agentic workloads on the TokenSpeed engine! This milestone wouldn't be possible without our incredible partners. Huge thanks to @lightseekorg, @NVIDIAAI, the Mooncake team, and @tri_dao for the pioneering FA4 optimization. Together, we are pushing the boundaries of open-source LLM inference. 🤝✨ Dive into the full @PyTorch blog post below! 👇 pytorch.org/blog/up-to-580tp… #Qwen #Qwen3_5 #TokenSpeed #LLM #Inference #AI #PyTorch #OpenSource #AgenticAI #HighPerformance

134

95,841

Minmin Sun

Minmin Sun @MinminSun2019

May 27

Big congrats to the TokenSpeed team & Qwen Inference team! 🙌 This is just chapter one. We’ll keep co-engineering to unlock speed-of-light inference for every Qwen model.

Qwen

@Alibaba_Qwen

May 27

1,539

LightSeek Foundation

LightSeek Foundation

@lightseekorg

May 27

Really happy to work together on pushing TokenSpeed to 580 TPS for agentic workloads on Qwen3.5 397B A17B. Open collaboration across the ecosystem keeps moving inference forward🚀

Qwen

@Alibaba_Qwen

May 27

93,525

Qwen

Qwen

@Alibaba_Qwen

May 27

PyTorch

@PyTorch

May 27

1,113

591,308

t.toda

t.toda

@Trtd6Trtd

May 25

tokenspeedでの再現確かに爆速

0:04

388

CV.YH

CV.YH

@0xCVYH

May 24

O número interessante do Qwen hoje não é leaderboard. É TokenSpeed mostrando ~540 tokens/s em workload agentico, com ~63k tokens de entrada e ~6,7 turns por request. Analogia: não é velocidade na pista reta. É entrega rápida no trânsito com várias paradas.

289