Clockwork.io builds software that optimizes GPU clusters for fault tolerance, deterministic performance and increased utilization.

Joined April 2021
189 Photos and videos
Pinned Tweet
Today we’re launching TorchPass — Workload Fault Tolerance. GPU failures, network disruptions, driver bugs — every fault forces a full job restart. Hours of compute, gone. TorchPass makes faults invisible to the workload. Training continues. No restarts. No lost progress.
3
1
4
173
clockworkio retweeted
The most important conversation you'll have this year is with someone who isn't answering your cold email. Not because they’re ignoring you. Because they're heads-down building. The one place where those conversations happen naturally is in person. Proximity is the ultimate currency in AI. RAISE Summit 2026 brings together the companies building the foundation, the frontier, and everything in between. From hyperscalers to dev tooling, from enterprise to financial, here's who's in the room. Headlining it all: @IREN_Ltd and @cognition The diamond tier: @Snyksec, @Nscale, @togethercompute, @Cursor_ai The infrastructure powering the frontier: @CoreWeave, @Nebiusai, @Vultr, @WEKAIO, @SambaNovaAI, @Baseten, @CrusoeAI, @DDNintelligence and more. The builders' toolkit: @ElevenLabs , @JetBrains, @GetSentry, @Modal, @Glean, @FireworksAI_HQ, and the rest of the AI-native stack. The enterprise financial backbone: @BankofAmerica, @Stripe, @MongoDB, @Datadoghq, @Alibaba_cloud, @Lenovo, @NetApp and more. 9,000 attendees. 2,000 companies. 350 speakers. 80% C-level and founders. July 8-9 2026 · Carrousel du Louvre, Paris Last chance: Save 25% before MIDNIGHT 🎟️ Secure your ticket: hubs.li/Q04k3TY20
3
33
11,744
The AI infrastructure utilization gap isn't a compute problem. It's a networking problem. And the fabric decisions that close it are being made right now. Roy Chua, Suresh Vasudevan, and Balaji Prabhakar broke down what’s changing in AI infrastructure economics. 🧵
1
2
4
101
Where the panel landed: Extreme co-design is the pragmatic answer today. But software that can optimize for workload-aware network fabrics wins long term. The teams that understand where that transition happens will make better fabric decisions now.
1
5
NVLink, UALink, or ESUN — do you have a scale-up strategy, or are you buying what ships first? The scale-up interconnect market is no longer NVIDIA's alone. UALink is gaining traction. ESUN is in the mix. Multi-vendor competition at the node level is real for the first time, and the procurement decision you make now is an architectural commitment that compounds across every training run, every inference workload, every dollar of GPU spend for the next several years.
1
11
The choice isn't just about bandwidth. It's about what you're locking into: the congestion control model, the failure domain boundaries, the observability surface, the upgrade path when the next silicon generation ships. Co-designed platforms get you integration. Open alternatives get you optionality. Neither is free. Most teams don't have a scale-up strategy. They have a vendor relationship and a ship date. At the capital intensity of today's AI infrastructure, that's an expensive way to make an architectural decision. 🔗 Live panel: NVLink, UALink, ESUN, and the fabric decisions that compound at scale linkedin.com/events/74659383…
1
1
10
400 interruptions in 54 days. That was Meta's Llama 3 training run. What's your number — and do you even know? Meta published theirs. Most teams don't track it at that resolution, which means they're absorbing the same cost without the visibility to act on it. Every interruption is a rollback, a checkpoint reload, a synchronization restart — GPU-hours consumed by recovery, not by model-advancing work. At 16,000 GPUs, the math compounds fast. A cluster that fails every few hours doesn't just lose the time to recover. It loses the training progress since the last checkpoint, plus the overhead of getting all workers back to a synchronized state, plus the ripple effects on everything scheduled after it. Meta had the engineering depth to instrument, publish, and learn from that number. Most organizations running serious training workloads are flying with less visibility than that — which means the cost is there, it's just not visible yet. Knowing your interruption rate is step one. Having a fabric architecture that reduces it — and recovers faster when it happens — is the actual lever. 🔗 Live panel on the networking decisions behind AI cluster reliability: na2.hubs.ly/H05FrYQ0 #AIInfrastructure #GPUClusters #DistributedTraining #AIFabric #DataCenterNetworking
14
Your fabric is three networks pretending to be one. Are you designing for that — or hoping it holds? Scale-up connects GPUs within a node. Scale-out connects nodes within a pod. Scale-across carries traffic between buildings. Each has different physics, different failure modes, and a different blast radius when something goes wrong. Most clusters treat them as a unified fabric. They aren't. The assumptions that hold at 256 GPUs quietly break at 4,096 — not because the hardware changed, but because the interactions between the three layers compound at scale. A congestion event in the scale-out fabric shows up as a training stall that looks like a compute problem. A thermal event in the scale-up layer propagates in ways the scale-across layer can't see. The failure modes multiply across the seams. The teams getting this right are asking different questions before the cluster is built: What's our congestion control strategy at each layer? Where does the blast radius of a failure stop? How do we instrument across all three without three separate forensics teams? NVLink, UALink, or ESUN isn't a procurement question. It's an architectural commitment with consequences that compound. 🔗Live panel on the fabric decisions reshaping AI infrastructure economics: na2.hubs.ly/H05FrYQ0
12
N is for NCCL, which timed out at dawn. F is for Flapping, a whispering dread. Q is for Queue pair, precise and exact — one stalls on the fabric and all lose the track. Edward Gorey wrote an alphabet of inevitable disasters. Clockwork.io’s Solutions Engineer Lerna Ekmekcioglu thinks he'd have been right at home writing about AI training at scale. 🧵
1
1
2
31
What that actually costs: → Slow queue pairs degrade throughput by 50% → A NIC fails, the job dies — reviving the NIC doesn't revive the job → Work lost is entirely determined by your last checkpoint
1
21
𝟗𝟐% 𝐨𝐟 𝐭𝐡𝐞 𝐜𝐨𝐬𝐭 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 𝐆𝐏𝐔 𝐜𝐥𝐨𝐮𝐝 𝐩𝐫𝐨𝐯𝐢𝐝𝐞𝐫𝐬 𝐡𝐚𝐬 𝐧𝐨𝐭𝐡𝐢𝐧𝐠 𝐭𝐨 𝐝𝐨 𝐰𝐢𝐭𝐡 𝐆𝐏𝐔 𝐩𝐫𝐢𝐜𝐢𝐧𝐠. It’s goodput: how much of your cluster is doing useful work vs recovering from failures. SemiAnalysis modeled a 5,184 GB300 NVL72 cluster: • TorchPass: 6% goodput expense • Checkpointless: 10.53% • Checkpoint restart: 20.91% At Llama 3 scale, checkpoint restart can leave ~80% of GPUs idle after repeated failures. As @SemiAnalysis_ put it, TorchPass is “the closest we've seen to what Frontier Labs are using at this scale.” Infrastructure efficiency is becoming the real AI scaling advantage. #AIInfrastructure #GPUClusters #Goodput #FaultTolerance #semianalysislinuxwebinar
21
At 4,096 GPUs, your training job's MTBF is hours, not days. Every interruption: (t_id t_failover)·j_size t_repair·b_radius, times $/GPU-hr. At $4/GPU-hr, an hour of idle 4k GPUs is ~$16k. Per incident.🧵
1
1
1
32
Why it works at scale: soft failures dominate above 1k GPUs. ECCs accumulating, GPU off the bus, NVLink errors, link flaps, slow-degrading nodes. Scheduler catches the signal before NCCL stalls. You migrate while the node can still hand off state cleanly.
1
35