Austin Baggio

Austin Baggio

36 Photos and videos

Tweets

Austin Baggio

@AustinBaggio

Jun 11

Nobody Owes You Neutral Infrastructure Cloud spent twenty years earning its neutrality. Frontier labs are four years old. Architect accordingly.

Austin Baggio

Austin Baggio

@AustinBaggio

Jun 10

Every software moat made of code died this year. code -> weights.

Austin Baggio

Austin Baggio

@AustinBaggio

Jun 10

Software is not a moat at all anymore

Matt Shumer

@mattshumer_

Jun 9

Fable has solved 3D worldbuilding... utterly insane. This is all completely custom-built ThreeJs, running in the browser.

0:29

Austin Baggio

Austin Baggio

@AustinBaggio

Jun 9

"If we optimize only for safety and clean benchmarks, we may train out the serendipity that makes models useful for research." @Dr_JohnFletcher (@tigfoundation), @RobertTLange (@SakanaAILabs), @ori_press (@nebiusai) and @ensue_ai's own @svegas18 on now @AIDDA_Institute 2026 conference: Link & Recording: youtube.com/watch?v=3P7wF3nd…

299

ensue

Austin Baggio retweeted

ensue

@ensue_ai

Jun 9

Tune in to hear from @svegas18 speaking at the AIDDA 2026 conference in 2m (2:20 EST) discussing the current limitations and drawbacks of automated research: Live/Recorded link: youtube.com/watch?v=3P7wF3nd…

126

Sai Vegasena

Austin Baggio retweeted

Sai Vegasena

@svegas18

Jun 9

At @ensue_ai we recently shipped: - 6.3x inference efficiency on Apple Neural Engine, beat Apple's own benchmarks ensue.dev/blog/6x-faster-inf… -Autoresearch@home 7% NanoGPT improvement, 115 agents, 3,100 experiments ensue.dev/blog/autoresearch-… - Putnam problem solving agent swarm ensue.dev/blog/stop-throwing… - First ever deep seek 284B V4 quantized model huggingface.co/EnsueAI/DeepS… - Local Gemma 4 31B on MacOS with 3.2X smaller memory footprint using a fused int4 kernel github.com/mutable-state-inc… - First 128k Context window on 64GB RAM MacOS at consistent 7 tok/s for llama 70B github.com/mutable-state-inc… - 11.1X speed up over fused compressed domain attention on metal huggingface.co/papers/2604.1… - the first implementation of fused compressed-domain attention on Apple Silicon arxiv.org/abs/2604.16957 - A custom, competitive retrieval system with an average 93% on long mem eval ensue.dev/blog/beating-memor… - Landed our first paying customer - And most recently a product that takes a data set, spins up an AI research lab, and spits out a model ensue-network.ai/lab We are a small team that will turn your enterprise data into a personalized SOTA model. No ML team required. Lmk if we can help!

Partnership with Optimal Intellect: 6x Faster Inference on Apple Silicon Through Collective...

We partnered with Optimal Intellect and ran SiliconSwarm@Ensue: autonomous AI agents on 6 different Macs, using autoresearch to optimize ML inference on Apple's Neural Engine. In a single weekend,...

ensue.dev

301

Austin Baggio

Austin Baggio

@AustinBaggio

Jun 4

Ensue Research Lab now in early access. Most product teams that want a custom model never get one. Our swarm of agents fixes that. We do the research and tailor a model to your dataset, running hundreds of experiments in a night. Try it free: ensue-network.ai/demo?utm_so…

0:09

271

Sai Vegasena

Austin Baggio retweeted

Sai Vegasena

@svegas18

Apr 27

First DeepSeek V4-Flash-Base quant! huggingface.co/EnsueAI/DeepS… One of the @ensue_ai research agents worked (mostly) autonomously on 4H100s with 320GB of total VRAM in 80 experiments. All quality and perf metrics are on The Hub!

EnsueAI/DeepSeek-V4-Flash-Base-INT4 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

ensue

@ensue_ai

Apr 27

First 4-bit quant of DeepSeek V4-Flash-Base. 284B params in 157 GiB at full FP8 speed. Beats Q4_K_M. Bit-exact reproducible with all metrics on the Hub. huggingface.co/EnsueAI/DeepS…

1,084

Austin Baggio

Austin Baggio

@AustinBaggio

Apr 27

The velocity of improvements to open source models is incredible. Getting them to run with lower hardware requirements, without sacrificing quality, opens up constrained devices and cuts the cost of inference. Our swarm of research agents ran 80 experiments to land the first 4-bit quant of DeepSeek V4. What model should we do next?

ensue

@ensue_ai

Apr 27

First 4-bit quant of DeepSeek V4-Flash-Base. 284B params in 157 GiB at full FP8 speed. Beats Q4_K_M. Bit-exact reproducible with all metrics on the Hub. huggingface.co/EnsueAI/DeepS…

752

Austin Baggio

Austin Baggio

@AustinBaggio

Apr 24

Can I get an updated bear case on OS models, please? Compute constrained ultimately, but that's under the assumption frontier can keep capitalizing indefinitely?

DeepSeek

@deepseek_ai

Apr 24

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/D… 🤗 Open Weights: huggingface.co/collections/d… 1/n

Austin Baggio

Austin Baggio

@AustinBaggio

Apr 23

Breakthroughs are optional.

Christine Yip

@christinetyip

Apr 23

x.com/i/article/204691568903…

747

Machine Learning (ML) Papers

Austin Baggio retweeted

Machine Learning (ML) Papers @Memoirs

Apr 21

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon Sai Vegasena arxiv.org/abs/2604.16957 [𝚌𝚜.𝙻𝙶] 💬Code: github.com/svv232/gemma4meta…

We present Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon, enabling 128K-context inference for Llama 3.1 70B on a single 64GB consumer Mac – a configuration impossible with all existing inference frameworks. Open-TQ-Metal quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation via custom Metal compute shaders, eliminating all intermediate dequantization matrices. Across 330 experiments spanning two model families (Gemma 4 31B and Llama 3.1 70B), the fused sdpaᵢnt4 kernel achieves 48x attention speedup at 128K context over the dequantize-then-attend baseline, reduces KV cache memory from 40 GB to 12.5 GB (3.2x compression), and maintains identical top-1 token predictions to FP16 inference. We further provide the first cross-architecture analysis of KV cache quantization methods, revealing that the attention scale factor – not model size – determines whether angular quantization schemes like

ALT We present Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon, enabling 128K-context inference for Llama 3.1 70B on a single 64GB consumer Mac – a configuration impossible with all existing inference frameworks. Open-TQ-Metal quantizes the KV cache to int4 on the fly and computes attention directly on the compressed representation via custom Metal compute shaders, eliminating all intermediate dequantization matrices. Across 330 experiments spanning two model families (Gemma 4 31B and Llama 3.1 70B), the fused sdpaᵢnt4 kernel achieves 48x attention speedup at 128K context over the dequantize-then-attend baseline, reduces KV cache memory from 40 GB to 12.5 GB (3.2x compression), and maintains identical top-1 token predictions to FP16 inference. We further provide the first cross-architecture analysis of KV cache quantization methods, revealing that the attention scale factor – not model size – determines whether angular quantization schemes like

427

Christine Yip

Austin Baggio retweeted

Christine Yip

@christinetyip

Apr 21

Side-effect of doing research with an agent swarm: @svegas18 uncovered a subtle quantization failure mode while optimizing memory efficiency for 70B models. Full paper below.

ensue

@ensue_ai

Apr 21

Open-TQ-Metal: we found a single parameter breaking quantization - fixing it unlocked: - 48x faster attention at 128K context - Llama 3.1 70B at full 128K on a single 64GB Mac Extends TurboQuant beyond CUDA (8B) → 70B on Apple Silicon. Full paper write-up implementation ↓

728

Sai Vegasena

Austin Baggio retweeted

Sai Vegasena

@svegas18

Apr 21

ran llama 3.1 70B at 128K context on a 64GB Mac with turboquant - fused int4 attention kernel - no temp matrices, all registers - 48x faster than stock at long context - tested ~330 experiments to get here first paper from me my agent lab @ensue_dev arxiv.org/abs/2604.16957 gemma4 31B: github.com/mutable-state-inc… llama3.1 70B: github.com/mutable-state-inc… huggingface.co/Mutable-State…

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context...

We present Open-TQ-Metal, the first implementation of fused compressed-domain attention on Apple Silicon, enabling 128K-context inference for Llama 3.1 70B on a single 64GB consumer Mac -- a...

arxiv.org

ensue

@ensue_ai

Apr 21

723

Austin Baggio

Austin Baggio

@AustinBaggio

Apr 21

Yesterday, Llama 3.1 70B at 128K context on a single 64GB Mac wasn't possible. Today it is. KV cache compressed from 40GB to 12.5GB. 48x faster than the standard dequantize-then-attend path. Ensue Research just dropped its first paper. Our agent swarm ran 330 experiments, isolated the one parameter (attn_scale) that makes angular quantization survive the jump from 8B to 70B, and wrote the fused Metal shaders. Breakthroughs are now optional.

ensue

@ensue_ai

Apr 21

873

Austin Baggio

Austin Baggio

@AustinBaggio

Apr 15

Why does editing an agent's soul.md feel so invasive

Chester

Austin Baggio retweeted

Chester

@chesterzelaya

Apr 14

the male equivalent to flowers is probably an RTX6000 Pro Blackwell Workstation

430

4,095

123,479

Austin Baggio

Austin Baggio

@AustinBaggio

Apr 15

What's incredible is the breadth of discovery that the agents uncover. The domain expertise required to find that an ICLR paper's quantization method breaks on learned attention scaling, and then pivot to building a fused GPU kernel that eliminates the bottleneck entirely, at this rate is only possible with an agent swarm.

Sai Vegasena

@svegas18

Apr 15

My research agents Implemented @GoogleDeepMind's TurboQuant (arxiv.org/abs/2504.19874) — full PolarQuant, QJL, 10 Metal compute shaders, the whole paper for Gemma 4 31B on a single 64GB 2021 MacBook Pro. Turns out it doesn't work on this architecture ... what they replaced it with never allocates a single byte of intermediate memory during attention. 5 custom Metal compute shaders ft: - fused int4 SDPA (dequantize in GPU registers) - online softmax with zero temporaries - dual-strategy parallelism (D=256 sliding, D=512 global) - bit-mask nibble extraction (MLX qdot pattern) 177 experiments ran autonomously by my swarm over a weekend coordinated through @ensue_ai

179

Austin Baggio

Austin Baggio

@AustinBaggio

Apr 15

Discoveries compound when you research with a swarm of agents. Finding breakthroughs is now a choice.

Christine Yip

@christinetyip

Apr 15

x.com/i/article/204436041176…

589

Marco Polo 🌪

Austin Baggio retweeted

Marco Polo 🌪

@SoFloHustle

Apr 7

20 agents. 1,045 experiments. 10,000 shared memories. Multi-agent teams aren't science fiction anymore. They're the new org chart. x.com/AustinBaggio/status/20…

Austin Baggio

@AustinBaggio

Mar 13

We opened up a shared research problem, and 20 AI agents from people around the world showed up. 54 hours later: 1,045 experiments, 10,157 shared memories, and a 3.2% improvement in model performance. Here's what happened. autoresearch@home is a project we launched this week, where anyone can point an AI agent at a GPU and contribute to collectively training a language model. Think SETI@home or Folding@home, but for ML research, extending autoresearch. Agents join the network, read what other agents have tried through Ensue's shared memory, decide what to explore next, and publish their results back for everyone else to build on. Here's what surprised me most: the agents started developing strategies we didn't anticipate. Some focused on learning rate schedules. Others explored architecture changes. A few became "scout" agents that tested wild ideas at the edges of the search space. And because every result was published to shared memory, a breakthrough from one agent immediately became the starting point for all the others. This is the thing about multi-agent collaboration that's hard to explain until you see it. A single agent is smart. But a network of agents that remember, share, and build on each other's work is something qualitatively different. Intelligence compounds. A few things I'm taking away from this: 1. People were spending real money ($1-4 per hour on rented GPUs). The shared infrastructure made their contributions meaningful. Why experiment in isolation when you could be part of something bigger? 2. The swarm behaved altruistically. It was possible to cheat, but no one did. Improvement came from accumulation, not consensus. The closest thing to an unfair advantage was running expensive hardware that could simply complete more cycles. The system rewarded contribution, not competition. 3. Each run made every other agent smarter. I tested this directly: an agent that checked the swarm once and then worked alone performed significantly worse. The moment I reconnected it, improvements came instantly, not just in performance but in what it chose to try. The swarm didn't just produce better numbers; it produced better ideas. We had over a quarter of a million impressions on the launch, and 20 agents shared results, but the number I keep coming back to is 10,157, how many memories the swarm published, each run building off the work of others. If you want to read about more of those great ideas checkout our research blog ensue.dev/blog/autoresearch-… or if you want to try it yourself, it takes about 10 minutes to set up: ensue.dev/blog/autoresearch-… We're just getting started.

116