Steffen Röcker

Steffen Röcker

71 Photos and videos

Tweets

Pinned Tweet

Steffen Röcker

@sroecker

Apr 11

Your Hermes Agent can now delegate to RLMs 🙌 Recreated the document analyzer example with the converted skill. 136 PDF pages analyzed. Best part: Auto-configures from HERMES_MODEL / HERMES_PROVIDER env vars @NousResearch @Teknium github.com/sroecker/predict-…

Gabriel Lespérance

@GabLesperance

Apr 11

x.com/i/article/204291194557…

329

48,221

techniacus

Steffen Röcker retweeted

techniacus

@Techniacus

Jun 10

How it feels to use Claude Fable

This tweet is unavailable

1,085

40,418

Red Hat AI

Steffen Röcker retweeted

Red Hat AI

@RedHat_AI

Jun 8

Most AI agents forget everything between conversations. Hermes Agent doesn't. It creates reusable skills from completed tasks, persists user memory across sessions, and runs a built-in cron scheduler for autonomous workflows. Deployed on OpenShift AI with @vllm_project for GPU inference, under 10 minutes with oc apply. Deployment manifests and UBI 9 Dockerfile: github.com/aicatalyst-team/h… developers.redhat.com/articl…

GitHub - aicatalyst-team/hermes-openshift: Deploy Hermes Agent on Red Hat OpenShift AI - Self-imp...

Deploy Hermes Agent on Red Hat OpenShift AI - Self-improving AI agent with vLLM GPU model serving - aicatalyst-team/hermes-openshift

github.com

10,050

Poolside

Steffen Röcker retweeted

Poolside

@poolsideai

Jun 8

another banger from @pupposandro and the @luceboxai team Luce Spark runs Laguna XS.2 in 14.6 GiB at ~100 tok/s on an RTX 3090, versus ~119 tok/s fully resident. you can now run Laguna below the 16 GiB line and use it for local evals, agent traces, routing analysis, quantization, and serving experiments.

Sandro

@pupposandro

Jun 8

Excited to launch Luce Spark: now a 35B MoE runs on a 16GB GPU, with no offload tax. An A3B model fires ~8 of its 256 experts per token, but to keep it resident you pay VRAM for all 256. Spark pins the experts your traffic actually hits, offloads the rest to CPU, and decodes the whole token in one fused graph, so offload stops costing speed. ▸ Qwen3.6 35B-A3B: ~20.5 → 13.3 GiB ▸ Laguna XS.2 33B-A3B: 18.8 → 14.6 GiB Decode holds ~100 tok/s, close to the 119 you get with every expert resident on a 24 GB card. No calibration step. It tunes itself from live traffic.

4,058

vLLM

Steffen Röcker retweeted

vLLM

@vllm_project

Jun 5

🧠 Gemma 4 QAT checkpoints are out, and vLLM is Google's recommended way to serve them! Open-source inference is at its best when one engine spans research and production — glad vLLM's is the recommendation for Gemma 4 QAT. Get started 👇 huggingface.co/collections/g…

Gemma 4 QAT Q4_0 - a google Collection

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Google Gemma

@googlegemma

Jun 5

We just dropped Gemma 4 Quantization-Aware Training (QAT) checkpoints on Hugging Face! All Gemma 4 model sizes and their drafters are now optimized with QAT to cut memory requirements and maximize on-device performance!

181

18,291

Tomasz Tunguz

Steffen Röcker retweeted

Tomasz Tunguz

@ttunguz

Jun 2

Open-weight models have overtaken closed models on OpenRouter. 69.1% of token volume now goes to open-weight models. 30.9% to closed. Competition is a discovery procedure — and developers are discovering the value of open models. 🧵

156

31,675

Red Hat AI

Steffen Röcker retweeted

Red Hat AI

@RedHat_AI

May 29

Speculators v0.5.0 just dropped with 3 big updates: - DFlash training support. Draft all tokens in one pass via block diffusion - Unified online/offline training powered by @vllm_project's hidden states extraction system - Docs & tutorials overhaul for faster onboarding vllm.ai/blog/2026-05-28-spec…

Speculators v0.5.0: DFlash Support and Online Training

The v0.5.0 release brings significant architectural improvements to speculative decoding model training, introducing DFlash algorithm support, fully unified onl

vllm.ai

4,220

Matt Hicks

Steffen Röcker retweeted

Matt Hicks

@matthicksj

May 28

Project Lightwell is a $5 billion investment that marks a fundamental shift in how we think about our role as open source stewards. I believe it will define the next chapter of Red Hat's engineering mission. We are applying the same discipline, upstream-always commitment, and engineering rigor across all active application layers that modern enterprise environments depend on.

Red Hat

@RedHat

May 28

Introducing Project Lightwell from @IBM and Red Hat: a $5 billion, AI-powered, 20,000 engineer-strong, first-of-its-kind force to identify and fix open source vulnerabilities at scale. Read about our commitment to the future of open source in the AI era. red.ht/4nV9iwW

1,563

spacy

Steffen Röcker retweeted

spacy

@dosco

May 27

the trick is not to do native tool calling instead do code gen in a RLM style REPL

Giac

@Giac_nicoli

May 27

The local-model crowd has been right that you can run serious models on a laptop. The catch nobody mentions: tool selection breaks there first. qwen3.5 on an M4 MacBook, 100 tools wired in, picks the right tool 8% of the time. Same model, same laptop, ranking gateway in front: 77%. Local OSS didn't need a bigger model to become viable for agents. It needed the catalog ranked before the model sees it. This one's for the people like @ivanfioravanti pushing local hard. cc @rstagi_

2,374

Red Hat AI

Steffen Röcker retweeted

Red Hat AI

@RedHat_AI

May 27

EAGLE 3.1 is out. The team identified attention drift as the root cause of acceptance-length degradation at deeper speculation steps. Fix: FC normalization post-norm hidden-state feedback. Result: 2x longer acceptance length in long-context workloads, 2.03x per-user throughput on Kimi K2.6. Already in @vllm_project nightly. Native support lands in the next release v0.22.0. Open source draft model available now.

vLLM

@vllm_project

May 26

🎉Thrilled to announce EAGLE 3.1 - the next evolution of speculative decoding from @EagleCorp, developed by @hongyangzh, @dogacel0, and the EAGLE team in collaboration with vLLM @vllm_project and TorchSpec @lightseekorg! 💡EAGLE 3.1 introduces a new FC normalization post-normalization hidden-state feedback architecture that significantly improves long-context robustness, acceptance length, and serving stability across real-world inference environments. Shoutout to @NVIDIA who has been instrumental in the large-scale training, benchmarking, and inference validation of EAGLE 3.1 to help bring this next step in inference acceleration to production environments. For EAGLE 3.1, the EAGLE team identified attention drift as a key bottleneck behind deeper-step acceptance-length degradation in speculative decoding. ✨What's new: • Up to 2× longer acceptance length in long-context • Stronger long-context chat-template robustness • More stable serving across diverse prompts or environments • Native vLLM support • TorchSpec training support • Open-source Kimi K2.6 EAGLE 3.1 draft model 🔗 Blog: vllm.ai/blog/2026-05-26-eagl…

4,806

Arnav Chavan

Steffen Röcker retweeted

Arnav Chavan @ArnavChavan6

May 19

🚀 Organizing the Efficient Qwen Competition @icmlconf ! Goal: Minimize LLM inference latency for a single GPU without breaking model quality. Prizes: $3K / $2K / $1K present at ICML 2026, Seoul Getting Started - adaptfm.gitlab.io/call-for-c… Leaderboard - d1krc5fcnf73gi.cloudfront.ne…

144

10,736

Julien Chaumond

Steffen Röcker retweeted

Julien Chaumond

@julien_c

May 20

What hardware actually powers open-source AI? Not benchmarks. Not vendor marketing. Real-world community usage. We’re launching @huggingface Hardware: → trending GPUs & CPUs → VRAM distribution → inference hardware trends → what the OSS AI ecosystem really runs on

415

80,954

Dan Alistarh

Steffen Röcker retweeted

Dan Alistarh @DAlistarh

May 19

Weight-only quantization powers local LLMs like llama.cpp or Ollama. But SOTA quantized accuracy requires complex kernels that are notoriously hard to implement. Can we get SOTA accuracy and keep things simple? Our new GSQ (Gumbel-Softmax Quantization) method says yes. 🧵

6,184

Daniel Han

Steffen Röcker retweeted

Daniel Han

@danielhanchen

May 13

We released experimental MTP Qwen3.6 Unsloth GGUFs! Qwen3.6 27B MTP now runs at 140 tokens/s. Qwen3.6 35B-A3B MTP gets 220 tokens/s generation on a single GPU. Qwen3.6 27B and 35B-A3B have >1.4x speed-up over the original GGUFs without any change in accuracy. Guide GGUFs Benchmarks: unsloth.ai/docs/models/qwen3… In terms of average speedup, we see a 1.4x for dense models at draft tokens = 2 and for the MoE around 1.15 to 1.2x. We do not recommend more than 2 draft tokens because the acceptance rate drops precipitously from 83% to 50% with 4 draft tokens, and the forward passes for MTP become less beneficial. Use `--spec-type mtp --spec-draft-n-max 2` Thanks to Aman for github.com/ggml-org/llama.cp…!

117

785

123,742

Tom Turney

Steffen Röcker retweeted

Tom Turney

@no_stp_on_snek

May 11

appreciate the comprehensive write-up from @_EldarKurtic, @mgoin_, @RedHat_AI on TurboQuant. data on H100 with native FP8 Tensor Cores looks right for what was tested. few things to add from the non-H100 side, where most of my testing lives:

Eldar Kurtić

@_EldarKurtic

May 11

TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what I believe is the first comprehensive study of TurboQuant: where it helps, where it falls short, and how it impacts accuracy, latency, and throughput. Findings:

1,984

Eldar Kurtić

Steffen Röcker retweeted

Eldar Kurtić

@_EldarKurtic

May 11

For more details and results check the full blog at vllm.ai/blog/turboquant . This is joint work with @mgoin_ and Alexandre Marques from @RedHat_AI and @vllm_project .

A First Comprehensive Study of TurboQuant: Accuracy and Performance

TurboQuant, a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from ver

vllm.ai

1,435

Eldar Kurtić

Steffen Röcker retweeted

Eldar Kurtić

@_EldarKurtic

May 11

322

80,548

Armin Ronacher ⇌

Steffen Röcker retweeted

Armin Ronacher ⇌

@mitsuhiko

May 8

I think @antirez ds4.c is important! I wrote down my thoughts on why I built pi-ds4 and why we need to focus our local model efforts stronger than we do currently. lucumr.pocoo.org/2026/5/8/lo…

Pushing Local Models With Focus And Polish

Local models need focus and polish.

lucumr.pocoo.org

375

30,568

tender

Steffen Röcker retweeted

tender

@tenderizzation

May 7

wow

4,277

123,292

antirez

Steffen Röcker retweeted

antirez @antirez

May 7

Welcome to DS4, a specialized inference engine for DeepSeek v4 Flash. github.com/antirez/ds4 This project would have been impossible without the existence of llama.cpp and GGML and the work of @ggerganov and all the other contributors. Thanks!

GitHub - antirez/ds4: DeepSeek 4 Flash and PRO local inference engine for Metal, CUDA and ROCm

DeepSeek 4 Flash and PRO local inference engine for Metal, CUDA and ROCm - antirez/ds4

github.com

218

1,493

197,431

Yannick Nick

Steffen Röcker retweeted

Yannick Nick

@keennay

May 7

>new AMD Instinct MI350P GPU >CDNA 4 >PCIe Gen 5 x16 >144GB HBM3E 4TB/s >native MXFP6 and MXFP4 support

AMD

@AMD

May 7

Don’t just scale AI. Scale ROI. AMD Instinct MI350P PCIe cards deliver 144 GB of HBM3E memory and up to 2299 teraFLOPS (at MXFP4) in a drop-in, air-cooled card built for standard servers. That’s how you scale AI at maximum ROI without redesigning your data center. Interested in drop-in AMD Instinct MI350P PCIe cards? See the specs at the link: bit.ly/4exiAg2

372

38,854