How can we debug the operating system in our own brains? This is the most important question facing humanity.

Joined November 2023
112 Photos and videos
BrainOS retweeted
the one box i was missing just landed anon. this is the @FrameworkPuter desktop with amd's strix halo, ryzen ai max 395, 128gb of unified memory, up to 96 of it addressable as vram. amd and framework sent it over for honest testing, no strings attached, and i've been waiting on this one specifically. here's why it matters. i've run local ai on basically everything, a 150 dollar drawer card, a 3090, a 5090, the dgx spark, datacenter h200s. the one gap was always the accessible big memory tier on the amd side, and this fills it. 128gb unified at roughly half the price of the nvidia equivalent, the sovereignty box for people who want to run real models without a datacenter budget. booting it today. and the question i actually want answered is the one nobody answers straight: what does this thing really run? same bar i hold every other card to. amd, nvidia, apple, measured, never vibes. let's find out what it's got.
listen up ROCm and Vulkan builders. @FrameworkPuter just shipped me strix halo desktop, 128GB unified, landing on my desk tuesday. everyone keeps asking what actually runs on this thing beyond vendor charts and forum guesses. so i'm going to answer it properly. starting with big MoE models since massive total params on light active is the whole point of 128GB unified. if there's a specific model or quant you want tested on strix halo, reply and it goes in the queue.
9
7
77
11,051
Just found out Gemma4 26B has some amazing video capabilities. Can you wait to try out manim. 1. Video Analysis (Visual & Audio) * Vision Analysis: If you provide a link to a video or a screenshot from one, I can use my vision capabilities to describe what is happening, identify objects, read text on screen, or analyze the composition. * YouTube Summarization: I can access YouTube transcripts to summarize videos, extract key points, or answer specific questions about the spoken content. 2. Video Generation & Animation * Manim Animations: I can write Python code using the Manim engine to create high-quality, mathematical, or algorithmic animations (similar to the style of 3Blue1Brown). * ComfyUI: I can interface with ComfyUI workflows to generate or manipulate video content if you have a generative setup configured. * ASCII Video: I can convert existing video or audio files into colored ASCII art formats (MP4/GIF). 3. Video Editing & Processing * Technical Processing: Using the terminal and Python, I can perform programmatic edits like trimming, resizing, converting formats, or extracting frames/audio using tools like ffmpeg.
1
61
BrainOS retweeted
.@nvidia gave us all the hardware we need to make local AI awesome. 16 x DGX Spark 3 x RTX Spark 1 x DGX Station 24 x ConnectX-7 Cables 2 x High Speed Switches local dot AI
How it started // how it's going @exolabs @NVIDIAAI
143
116
1,662
265,836
BrainOS retweeted
NOVEMBER 30, 2010: Four weeks after losing the election as AG of CA, Kamala Harris wins due to mail-in surge from LA Harris won by less than 1%
850
13,720
57,024
1,926,254
BrainOS retweeted
Replying to @uzairansar
PSA: This is not an official app by us - it would be nice uzi if we made this more clear in the name or something
23
12
721
16,113
This is a much watch for anyone considering local inference. Excellent presentation. Great job.
May 29
I yapped about LLM Compression for 40 minutes, how much misinformation did i spread this time (,:
2
55
BrainOS retweeted
May 27
Deepseek-v4-flash REAP is done There's an 80 and 96gb version (weights) I am working through the pain of getting the model to run on sm121 (DGX Spark) If anyone of you has a working DS4 vllm/sglang config/env/docker for Sparks please share Once it works I'll make the HF public
29
13
287
12,420
Running NVIDIA Nemotron-3-Super-120B-A12B (NVFP4 mixed) on DGX Spark / GB10 (Grace-Blackwell) via vLLM 0.19.2rc1. Architecture resolves as NemotronHMTPModel — hybrid Mamba Transformer with MTP draft head. PROBLEM: prefix caching is non-functional on this model. With --enable-prefix-caching explicit, vLLM emits this warning on boot: "Prefix caching in Mamba cache 'all' mode is currently enabled. Its support for Mamba layers is experimental." Measured behavior over a real production run: queries: 6,211 hits: 0 hit rate: 0.00% Confirmed across thousands of requests with shared system prompts that should be deduplicating cleanly. They aren't. This isn't a misconfiguration — NVIDIA's own Nemotron-3-Super deployment guide (<github.com/NVIDIA-NeMo/Nemot…> SparkDeploymentGuide) deliberately OMITS --enable-prefix-caching. So upstream knows it doesn't work on NemotronH. WHY IT MATTERS: We run a 4-way parallel fan-out pattern — brain decomposes a brief, dispatches 4 concurrent section-writes against vLLM with an identical 163-token shared system prompt, stitches results. Measured throughput: c=1: 15.0 tok/s c=2: 25.4 tok/s c=3: 30.1 tok/s c=4: 41.4 tok/s (← saturation; 4th slot effectively free) That 2.76× user-facing speedup is real. But every one of those 4 calls re-prefills the same 163-token system message independently. With a working prefix cache, that prefill cost drops ~4×. For larger shared prompts (RAG context, long instruction blocks, few-shot exemplars), the win compounds enormously. THE STRUCTURAL QUESTION: Is this a fundamental limit? Mamba's selective- scan state can't be paged like transformer KV — it's a fixed-size recurrent state, not a sequence of key/value vectors. So for the Mamba LAYERS of a hybrid model, prefix cache semantics are genuinely unclear. But for the TRANSFORMER LAYERS interleaved with them, the KV reuse should work fine. Has anyone at @NVIDIAAIDev or @NVIDIAAI considered a "hybrid prefix cache" mode that caches the transformer-layer KV pages for a matched prefix while re-running the Mamba state forward? Even a partial fix would eliminate most of the prefill cost on this architecture. Or — is there a known issue / planned vLLM PR I should be watching? Happy to share the bench harness if it's useful for repro. cc @vllm_project — same question your side. Hardware: DGX Spark, GB10, 121 GiB unified memory, --max-num-seqs 4, --quantization fp4, MARLIN MoE backend, async scheduling, no MTP (breaks structured tool-call emission — separate issue).

1
2
137
I’d like to get some feedback on running multiple models on a DGX. Anyone else run into this?
41
I have become heavily addicted to local AI. It’s consuming me. Thankfully I have broken several other addictions. So I know I’ll eventually break this one. I’ve learned it’s the attachment to the feeling that the addiction gives you. Local AI excites me and I think it’s a miracle. Using it and thinking about the future makes me feel alive like never before. For those who have never promoted a jailbroken 35b local AI model, you wouldn’t understand.
81
BrainOS retweeted
With Grok, you can literally plug xAI's entire stack directly into Hermes Agent and OpenClaw and actually run it powerfully It comes with a full, powerful suite that makes every other AI plan look obsolete Grok-powered setup gives you: • xAI's most powerful models • Speech-to-text (STT) • Text-to-speech (TTS) • Grok Imagine for images • Grok Imagine for videos • Native 𝕏 Search built-in • Native web search • Massive context window The best part is that you can literally use these capabilities directly with your existing 𝕏 Premium and SuperGrok subscriptions....not with expensive API bills There is literally no better deal in AI right now
70
112
777
25,787
BrainOS retweeted
May 20
got NVIDIA’s new Nemotron-Labs-Diffusion 8B running locally on my DGX Spark. Jetson(hermes) made me the tri mode runner the cool part: it’s one model that can answer in different “gears.” same prompt, same checkpoint: - normal mode: 10.98 tok/s - diffusion mode: 20.58 tok/s - self-spec mode: 18.01 tok/s - self-spec lora: 18.07 tok/s plain english: diffusion mode was almost 2x faster than normal mode in this tiny first test. but faster wasn’t automatically better. the fastest mode also started repeating itself, so now the real test is running a bigger prompt suite and checking both: - how fast it answers - whether the answer is actually good early result: the tri-mode idea works locally. next step is figuring out which mode is best for which kind of prompt.
4
4
39
5,555
BrainOS retweeted
Big news for DGX Spark users! NVFP4 optimized for your GB10 is fast. Until now, many of us have been using a working but suboptimal Marlin backend instead of CUTLASS. We benchmarked five models, all using CUTLASS backends. All ran stably, all were faster 😁
8
8
113
20,556