weโre building von0.5B(idk why i picked this name but it sounds cool loll) : a ~500M parameter small language model trained from scratch.
the goal is not just another LoRA adapter. this is a standalone small coding model pipeline: dataset staging, tokenizer training, scratch pretraining, SFT, ORPO preference optimization, and benchmark gating before any performance claims.
current progress:
- staged an 80k-row coding mixture on Kaggle(data is smallll, gathering more hopefully)
- mounted curated external coding datasets into a reusable training dataset
- validated a scratch pilot end-to-end
- launched the full von500m pretraining run on Kaggle 2xt4 using the staged mixture
the focus is high coding performance per parameter, with edge/phone usability as a secondary deployment target.
no outputs yet but im "building in public"
why?: i needed to be able to run models on my phone but current ones due to heavy quantizations keep outputting garbage.
and then i thought to build one myself from scratch, definitely not an easy task but a good one