Hello Everyone 👋
I’m building DreamOS, a minimal Linux-based operating system focused entirely on local AI inference. It boots directly into a dashboard for chatting, model management, benchmarks and live hardware metrics.
The core control plane is written in Rust. DreamOS uses the official NVIDIA Linux/CUDA stack, while owning the layers above it: hardware-aware configuration, model residency, KV-cache management, inference scheduling, context handling, telemetry and automatic backend tuning.
The current development target is Qwen3.5-9B Q4_K_M on an RTX 4060 with a 32K context. Initial runtime experiments have reached up to approximately 45 tokens/s, but I’m still building matched, reproducible benchmarks before claiming a definitive improvement over standard llama.cpp.
I’m now profiling the exact CUDA decode bottlenecks, experimenting with custom Q4_K kernels, KV-cache compression, speculative decoding and persistent execution. There is also a longer-term bare-metal DreamOS research track, but the practical product comes first: making local models run faster, remain loaded, consume less memory and feel like a complete AI-native system rather than another application running on a general-purpose desktop OS.
*Illustrative image