Gemma4 E2B, compressed by @TheStageAI , from 9.3GB to 1.4GB, is running on iPhone 16e with tool calls!
The smallest and the best quality checkpoints open-sourced! @GoogleDeepMind
The smallest checkpoints for Gemma 4 E2B and E4B for local inference. Results for E2B:
size: 9.3 GB → 1.4 GB
speed: 113 tok/s on Apple M3
quality: -3% on ifEval
runs with: MLX, llama.cpp (coming)
Pareto-optimal, open source! Links to the blog post and GitHub repo ⬇️
@GoogleDeepMind@lmstudio@ollama@huggingface@ggerganov
Beyoncé heard cursing. TheWhisper heard Arsenal.
The fastest Whisper in the world.
Open-source real-time ASR.
Top 5 on OpenASR benchmarks.
1800 RTFx.
Built for live captions, transcription, and voice apps.
See the repo
For AI engineers, latency is product.
Wan 2.2 in Elastic Models now generates 5s of video in 34s on H100. Elastic Models is a library of accelerated open-source models.
Also new: TheWhisper at 1800 RTFx on a single H100 and instant FLUX LoRA switching.
Try it
How do you make text-to-music run in real time in production?
The model has to keep audio generation ahead of playback.
Our new case study with @MireloAI shows how inference optimization delivered up to 2.4х higher throughput.
See the full case study ↓
Proud to team up with @brilliantlabsAR and @neuphonicspeech on Halo’s on-device privacy engine.
Coming to Brilliant Labs’ Halo smart glasses: real-time voice vision, POV stays private.
ANNA GPU/NPU SDK memory manager for wake word, STT, TTS, diarization.
SDK demo 👇
Are you a big fan of jacket potato?
This is an open-source, real-time multilingual ASR for live speech.
It stays robust in heavy noise – even at SNR 0 dB.
That’s why it understands speech where people struggle to hear.
Use it for transcription, research, and multilingual apps
At TheStage AI, we shipped @nvidia cuDNN Paged Attention in our Elastic Models library.
We replaced paged FlashAttention for better integration. In our benchmarks, the cuDNN path shows nearly identical quality and latency vs the previous implementation.
Early results on B200: INT8 Llama 8B ~200 tok/s per sequence @ bs16 (≈ 3,200 tok/s aggregate).
The write-up also covers CUDA Graphs, graph caching, cuDNN Paged Attention, and INT8 LLMs. Next we are moving to native inference support across NVIDIA hardware including Jetson.
Check blog for details:
app.thestage.ai/blog/Integra…
New SOTA TheWhisper checkpoint.
Update is out.
Open-source multilingual STT built for real-time streaming and noisy audio.
6.0 WER on Open ASR, ahead of Parakeet and Whisper.
Optimized with our stack – ANNA, Automated Neural Networks Accelerator.
Code is open. GitHub →
Significant speed and size gains in model inference are possible without hurting output quality.
ANNA is our PyTorch framework for automated model acceleration, a new way to think about MLOps.
Smaller ckpts, lower cost, faster inference, no retrain.
Test demo or request access
We’ve made it easy to run text-to-image models on @Modal with the speed you’d expect from top inference providers.
Follow our quick guide to deploy containers with an @OpenAI compatible API and get 2× faster performance.
Big thanks to @MireloAI for the soundtrack magic 🎶
Great communities make great products.
At @TheStageAI, we’re building ANNA, our Autonomous Neural Networks Accelerator, for faster, cheaper inference.
We need a Community Manager now. Be part of the early story →
TheStage AI is now SOC 2 Type I compliant.
We did it to keep models, data, and IP secure. Clients get confidence, simpler procurement, and compliant AI deployment.
This milestone sets us up to grow into enterprise, government, and regulated markets.