TENSORHUB: SCALABLE AND ELASTIC WEIGHT TRANSFER FOR LLM RL TRAINING
This paper is about fixing the storage vs transfer tradeoff in RL weight sync. NCCL gives you throughput, but it struggles with dynamic membership. UCX gives you flexibility, but fan-out turns the sender into the bottleneck. Storage gives you decoupling, but push-then-pull is the wrong shape once weights are huge. TensorHub’s move is to stop treating storage as something that has to own bytes. Instead of copying weights into a storage layer, it treats already-replicated live weights on GPUs as the storage substrate and serves reads from those replicas directly.
Core
Use live replicated weights as the storage layer:
- no explicit ownership of weight data
- no extra stored copies in the common case
- direct GPU-to-GPU reads
- storage-style decoupling between trainers and rollouts
- centralized scheduling with decentralized transfer
What it is
The core abstraction is Reference-Oriented Storage (ROS). ROS makes a weight version look stored and fetchable, but it never materializes an owned copy unless it has to. It just tracks which workers already hold that immutable version and routes reads to them. TensorHub is the production system on top of that idea, adding topology-aware scheduling, consistency across model-parallel shards, and failure handling.
Why this is different
The difference is in the ownership model, not just the transfer speed.
- NCCL gets bandwidth, but static groups and global coordination make it fragile under churn
- UCX avoids collectives, but sender-side uplinks collapse under fan-out
- storage systems decouple cleanly, but they pay for that abstraction with extra movement and extra memory
TensorHub is trying to keep the storage interface without paying the normal storage tax.
Key mechanisms
To make that work, they need a few systems pieces:
- a mutability contract so a worker can reuse buffers without corrupting concurrent reads
- a retention protocol so required versions do not disappear when the last live replica is about to go away
- topology-aware source selection at the server
- pipelined replication so partially updated replicas can immediately start serving downstream reads
- transactional consistency for model-parallel groups
- failure recovery across replica churn, spot loss, and server failover
Results
The numbers are strong and pretty practical:
- up to 6.7× lower total GPU stall time for standalone rollouts vs NCCL
- 4.8× faster weight update for elastic rollout vs UCX
- 19× lower GPU stall time for cross-datacenter rollout
- 50 GB shard transfer in 2.2s at 22 GB/s, about 88% of theoretical RDMA bandwidth
- Ray object store takes about 32s for a 40 GB transfer where GPU-direct RDMA takes about 0.2s
The important systems point is that they recover most of the decoupling benefits people want from storage, but without turning every weight update into push to storage then pull from storage. Once weights are already replicated for inference, copying them again into an ownership-based storage path is mostly self-inflicted overhead.