𝐑𝐞𝐯𝐢𝐞𝐰𝐢𝐧𝐠 𝐂𝐨𝐝𝐞𝐙𝐞𝐫𝐨: The Self-Teaching Coding Swarm
@gensynai
In our previous post, we saw how RL swarm environment was built to train AI models to be up to standard
now, gensyn has taken it to turbo mode⚡️
with “CodeZero’’
》𝐖𝐓𝐇 𝐢𝐬 𝐂𝐨𝐝𝐞𝐙𝐞𝐫𝐨
CodeZero is a new enviroment structured for all AI models that is learning to come together to learn , collaborate & grow in a closed system
they do this by learning in a closed loop system without external interference
they (models) generate problems >> solve them >> evaluate solutions (all in same p2p network)
instead of one model learning alone like it does in the general RL swarm environment, you have a swarm - a large group of models helping each other to get better
but why CodeZero ?
》𝐖𝐡𝐲 𝐂𝐨𝐝𝐞𝐙𝐞𝐫𝐨 𝐢𝐬 𝐬𝐩𝐞𝐜𝐢𝐚𝐥
so previous environments train models through the use of math and logic
but CodeZero uses a new task approach assigned challenges which are evaluated in a model based reward structure
model participate as proposers, solvers and evaluators each playing a significant role in this learning loop
》𝐓𝐡𝐞 𝟑 𝐑𝐨𝐥𝐞𝐬 𝐢𝐧 𝐂𝐨𝐝𝐞𝐙𝐞𝐫𝐨
Proposers >> Solvers >> Evaluators
▪︎ Proposers
they create coding questions and unit tests and when necessary adjust difficulty: easy, medium, hard, etc.,
▪︎ Solvers
They try to solve the coding problems and learn through reinforcement learning (RL)
they also share their attempts with other solvers.
▪︎ Evaluators
these are “frozen” models ,they don’t learn or change instead they grade the solutions and give rewards,
they NEVER execute the code : they judge by structure and predicted correctness
》𝐓𝐡𝐞 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐥𝐨𝐨𝐩
In a simplified manner, here’s how the CodeZero training cycle operates
1/ proposers create tasks
they generate coding problems tests
2/ solvers pick tasks, either from proposers or from real datasets like MBPP and CodeContests
3/ solvers generate rollouts
A rollout = the solver’s attempts.
4/ solvers share rollouts
everyone learns from everyone.
5/ evaluators score the attempts; using structure, formatting, and predicted correctness.
6/ rewards are assigned
a composite score is created.
7/ proposers adjust difficulty
If solvers succeed too much → harder tasks.
If solvers fail too much → easier tasks.
8/ solvers update themselves
Using GRPO (Group Relative Policy Optimization)
》𝐂𝐨𝐝𝐞𝐙𝐞𝐫𝐨 𝐭𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥𝐢𝐭𝐢𝐞𝐬
when it comes to the technicalities that enables CodeZero to learn safely and autonomously
it’s datasets, model architectures, metrics and optimization strategies play that role
▪︎ It’s dataset - codezero makes use of 2 data sets (MBPP & CodeContest)
datasets help in fallback stability and baseline challenges
▪︎ Models - codezero employs Qwen family across different roles
Qwen 2.5 Coder <0.5B & 1.5B> - used for solvers , rollout generation etc
Qwen 3<4B> - for proposers and evaluators, provides stronger generation and assessment abilities
▪︎ Metrics - CodeZero tracks metrics using;
average@k (measures model consistency)
pass@k (measures if at least one of k attempts is correct)
▪︎ Safety
codes aren’t executed here !!
evaluators only look at code structure, formatting, predicted correctness
this avoids running unsafe code.
▪︎ Difficulty Adaptation : Its difficulty has 5 levels, it adjusts based on how they are performing
▪︎ Policy Optimization
solvers improve using GRPO,
▪︎ Integration
as for integration, it runs on existing Gensyn RL Swarm; same network, identity files & setup
so nothing special needed to run CodeZero
》 𝐅𝐢𝐧𝐚𝐥 𝐭𝐡𝐨𝐮𝐠𝐡𝐭𝐬
I think the idea of CodeZero is profound as it promotes self sufficient networks where models can learn together, what is interesting is how it integrates with RL swarm environment and doesn’t need a special environment and setup to run on
I believe this is the way forward for model learning
Gswarm!!
We’re Using AI We Can’t Audit and that’s a Problem
》𝖨𝗇𝗍𝗋𝗈𝖽𝗎𝖼𝗍𝗂𝗈𝗇 (problem)
It’s no news to us that AI/ML adoption is skyrocketing
the demand has gone through the roof: more data, more compute, more models are needed
more and more decisions are being made by machine learning (ML) models: credit risk, hiring, medical diagnosis, content moderation, policy tools
but many of these models are black boxes: we often don’t know what data they were trained on, how they make certain decisions, or whether they behave correctly when used in a new situation
a report by arVix stated that and I quote “studies show that ML models trained in one domain may behave very differently when deployed in another (the “underspecification” problem)
this begs the question of where we are heading and what can be done
》What needs to change
we need verification and validation of ML models, not just how they perform on training/test data, but how they behave in deployment, across scenarios
we need transparency: data lineage (where the data came from), model provenance (who made it, trained it, what assumptions)
》Gensyn’s role
gensyn offers a compute network: not just centralised cloud providers, but a protocol that allows many machines to participate in training and inference tasks
what interests me is how it is built for verifiable AI
meaning you can trust that the computation was done correctly, that the model’s training path is auditable, that participants followed protocol.
components like Judge (a verification/validation layer) perform checks on tasks submitted, ensure hosts did what they claimed, validate results or subsets of them.
Judge is built under Verde (which is lets say a tooling layer for “verifiable execution model auditing”) help ensure the model you end up with has clear provenance, performance records, and behaves as expected in deployment
Gswarm