After the last 24 hours, how can you read this and not see that decentralized training is going to be the most important innovation to come out of crypto since Bitcoin?
The 8B model currently training on Agora is 350B tokens in and continuing to converge. The top level metrics and evals look almost exactly like a centralised run. But;
- 133 external contributors total bringing 4090's, 5090's, L40S/RTX 6000 and RTX 6000 Pros. These are cards that people actually own - there are no H100, B200's etc.
- The max number of nodes the system can support (104) was filled almost immediately. The authorization layer is receiving approximately 100 requests/minute to join.
- The total tokens/per second processed moves directly with amount of compute in the swarm, with Agora constantly optimising to make most efficient use of what hardware is present.
- MFU is approximately 20%, TPS is 170k tok/s. There are near constant communication failures which Agora is completely absorbing without slowdown.
- The system is effectively on auto-pilot, requiring very little intervention from us. Bad nodes are purged immediately before training is affected and new nodes take their place.