Filter
Exclude
Time range
-
Near
キヌア retweeted
mimo.xiaomi.com/blog/mimo-ti… 話題のシャオミの1000tok/s、お馴染みのtokenspeedでどんだけ早いかを試してみた
1
6
75
17,157
Navya Nizamkari retweeted
May 27
The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs. In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 bit.ly/4uGUvIS This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI
12
48
288
277,237
May 30
One of my favorite pieces of work in @lightseekorg TokenSpeed is TokenSpeed Kernel. Our bet is simple: CuteDSL Triton Gluon. CuteDSL is backed by NVIDIA. Triton Gluon is backed by OpenAI. The team has deep expertise in both ecosystems. And we’re fortunate to have @LeiLMx — the designer behind that and a core maintainer of OpenAI Triton. Building great kernels is hard. Building a great kernel ecosystem is even harder. Excited to see the community coming together to push toward a world-class open kernel stack for AI.
6
8
126
9,483
Oye que clase de brujería es el engine TokenSpeed de @lightseekorg 🤯 Que locura de engine no se como no lo vi antes. Anoche dejé un experimento antes de irme a dormir donde en una misma GPU dejé corriendo un Qwen con el engine de vLLM por defecto y otro con TokenSpeed El experimento no era más que un bucle lanzando diferentes peticiones de diferentes tamaños de tokens en batch. Algunos batches de 200K tokens. Esta mañana he mirado la performance cada uno y el de TokenSpeed ha ido 6x más rápido 😬 Sin hacer nada más, solo cambiando el engine (y bueno instalando un par de paquetes nada raro). Que barbaridad. Tengo que probar bien que no rompa nada, compatibilidades, etc para ver si es factible una migración pero de tener todo okey con esto se viene un upgrade tremendo en NaN y en Helmcode. Esto en NaN sobre todo en horas pico va a ser una ayuda enorme. Lo malo que hoy me toca estar grabando vídeos y no podré seguir con esto pero mañana intento darle caña a todo esto a ver qué resulta.

ALT The Flash GIF

4
1
45
5,366
Proud to collaborate with @Alibaba_Qwen, @lightseekorg, @NVIDIAAI, @PyTorch, and @tri_dao on this milestone 🚀 Together, we helped push Qwen3.5 on the TokenSpeed inference engine to a record-breaking 580 tokens/sec for agentic workloads on NVIDIA GPUs. From KV cache systems and runtime infrastructure to kernels, scheduling, and benchmarking, this was a true cross-stack co-design effort for high-performance open-source LLM inference. Full PyTorch blog 👇 pytorch.org/blog/up-to-580tp…

May 27
The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs. In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 bit.ly/4uGUvIS This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI
1
4
14
1,473
Replying to @PyTorch
Impressive work from the teams behind @q, TokenSpeed, @NVIDIAAI, Mooncake & others. 🚀 580 tps for agentic workloads is a strong demonstration of how model architecture and inference optimization must evolve together to unlock real-world AI performance. At @Qubrid_AI, we're seeing growing enterprise demand for high-throughput, low-latency deployments of open-source models, and advancements like these help accelerate production-ready AI adoption. Looking forward to seeing what's next for the open AI ecosystem.
3
1,661
PyTorch just dropped a wild inference speed record Up to 580 tokens per second on the massive Qwen3.5-397B model, running on NVIDIA Blackwell GPUs with the open-source TokenSpeed engine. pytorch.org/blog/up-to-580tp…

4
113
May 27
No office hours. No meetup. Just 3 weeks after launch, TokenSpeed already got support and adoption from Qwen and the PyTorch ecosystem. We can just build things. 🚀
Fast, faster, Qwen. 🚀 Thrilled to see Qwen3.5 reaching a record-breaking 580 tps for agentic workloads on the TokenSpeed engine! This milestone wouldn't be possible without our incredible partners. Huge thanks to @lightseekorg, @NVIDIAAI, the Mooncake team, and @tri_dao for the pioneering FA4 optimization. Together, we are pushing the boundaries of open-source LLM inference. 🤝✨ Dive into the full @PyTorch blog post below! 👇 pytorch.org/blog/up-to-580tp… #Qwen #Qwen3_5 #TokenSpeed #LLM #Inference #AI #PyTorch #OpenSource #AgenticAI #HighPerformance
2
5
134
95,841
Big congrats to the TokenSpeed team & Qwen Inference team! 🙌 This is just chapter one. We’ll keep co-engineering to unlock speed-of-light inference for every Qwen model.
Fast, faster, Qwen. 🚀 Thrilled to see Qwen3.5 reaching a record-breaking 580 tps for agentic workloads on the TokenSpeed engine! This milestone wouldn't be possible without our incredible partners. Huge thanks to @lightseekorg, @NVIDIAAI, the Mooncake team, and @tri_dao for the pioneering FA4 optimization. Together, we are pushing the boundaries of open-source LLM inference. 🤝✨ Dive into the full @PyTorch blog post below! 👇 pytorch.org/blog/up-to-580tp… #Qwen #Qwen3_5 #TokenSpeed #LLM #Inference #AI #PyTorch #OpenSource #AgenticAI #HighPerformance
2
10
1,539
Really happy to work together on pushing TokenSpeed to 580 TPS for agentic workloads on Qwen3.5 397B A17B. Open collaboration across the ecosystem keeps moving inference forward🚀
Fast, faster, Qwen. 🚀 Thrilled to see Qwen3.5 reaching a record-breaking 580 tps for agentic workloads on the TokenSpeed engine! This milestone wouldn't be possible without our incredible partners. Huge thanks to @lightseekorg, @NVIDIAAI, the Mooncake team, and @tri_dao for the pioneering FA4 optimization. Together, we are pushing the boundaries of open-source LLM inference. 🤝✨ Dive into the full @PyTorch blog post below! 👇 pytorch.org/blog/up-to-580tp… #Qwen #Qwen3_5 #TokenSpeed #LLM #Inference #AI #PyTorch #OpenSource #AgenticAI #HighPerformance
1
2
18
93,525
Fast, faster, Qwen. 🚀 Thrilled to see Qwen3.5 reaching a record-breaking 580 tps for agentic workloads on the TokenSpeed engine! This milestone wouldn't be possible without our incredible partners. Huge thanks to @lightseekorg, @NVIDIAAI, the Mooncake team, and @tri_dao for the pioneering FA4 optimization. Together, we are pushing the boundaries of open-source LLM inference. 🤝✨ Dive into the full @PyTorch blog post below! 👇 pytorch.org/blog/up-to-580tp… #Qwen #Qwen3_5 #TokenSpeed #LLM #Inference #AI #PyTorch #OpenSource #AgenticAI #HighPerformance

May 27
The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs. In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 bit.ly/4uGUvIS This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI
39
92
1,113
591,308
tokenspeedでの再現 確かに爆速
1
5
388
May 24
O número interessante do Qwen hoje não é leaderboard. É TokenSpeed mostrando ~540 tokens/s em workload agentico, com ~63k tokens de entrada e ~6,7 turns por request. Analogia: não é velocidade na pista reta. É entrega rápida no trânsito com várias paradas.
3
289