AI @ Red Hat

Joined February 2026
Photos and videos
Sawyer Bowerman retweeted
New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with @RedHat and taught by @cedricclyburn. Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just to load the weights. On top of that, every active request needs its own chunk of GPU memory, the KV cache, to store the token context it has built up so far. In this course, you'll learn to reduce a model's memory footprint with quantization and serve it using vLLM, which handles many concurrent requests efficiently through smart memory management. Skills you'll gain: - Quantize a model and measure the accuracy tradeoff - Serve a model with vLLM and watch it handle concurrent requests efficiently - Benchmark your deployment and make informed tradeoffs between speed, cost, and accuracy Join and learn to serve LLMs efficiently: deeplearning.ai/courses/fast…
93
141
1,073
104,478
Sawyer Bowerman retweeted
Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with @vllm_project on Day 0:
3
24
120
15,870
Sawyer Bowerman retweeted
This one has been in the works for a while. @cedricclyburn teaching LLM inference, compression, and benchmarking with @vllm_project -- free course with @DeepLearningAI. Proud of this one.
New short course: Fast & Efficient LLM Inference with vLLM, built in partnership with @RedHat and taught by @cedricclyburn. Learn to quantize an open-source LLM, serve it with vLLM, and benchmark your deployment across speed, cost, and accuracy. Free to enroll: hubs.la/Q04jXfpR0
3
9
45
4,503
Sawyer Bowerman retweeted
Just in: @NVIDIA is giving away a DGX Spark to a lucky meetup participant in London next week. See you there!
🇬🇧 London, June 10. @vllm_project & @_llm_d_ Inference Meetup hosted by Red Hat AI, @nvidia, and @SteliaAI at Sustainable Ventures, County Hall. On the agenda: vLLM project update, speculative decoding, llm-d in production, and AI safety evaluation. luma.com/iuecyow4
2
12
1,330
Them: "Can we do standup in Minecraft?" Us (me @cedricclyburn): *builds entire backend integration with OpenShift, Quarkus, and GitHub GraphQL API Players now run /git to change repos. Because why not? 🎮 Full presentation: youtu.be/b4LyNQAlrm4youtu.be…

1
1
3
213
Sawyer Bowerman retweeted
OpenClaw runs with your user-level permissions by default, meaning it inherits access to your GitHub token, Slack creds, filesystem, local network. On Red Hat OpenShift AI: container isolation, default deny networking, secrets management, and OpenTelemetry traces. Here's how:
1
8
42
3,885
Sawyer Bowerman retweeted
Your GPU is probably running at 30-40% utilization in production. The fix isn't a smaller model. It's 7 techniques that unlock 2-5x performance from your existing hardware. A 🧵:
5
41
349
32,089
Is your AI budget scaling faster than your results? Check out my new guide to make sure you're informed on some of the new happenings around optimizing AI performance in 7 digestible techniques! sprou.tt/1zPUg3gc3cu
2
37
As AI and LLMs become more and more powerful, the boundaries for efficient inference simply break down. If you want to experiment with quantized models as well, you can follow the code in the video, and check out huggingface.co/RedHatAI to pick any model you'd like!
What compression looks like on @vllm_project. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99% accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 @_soyr_ for the 2-minute demo.
4
253
Sawyer Bowerman retweeted
The Wildcard Ever wanted to do your daily stand-up in @Minecraft? @_soyr_ shows us how (and why!) they did it. Don't miss these sessions, May 6-7! Register here: devoxx.co.uk
1
2
131