Last week my custom T5-style model was collapsing gibberish outputs and endless CUDA crashes.
Instead of starting over, I spent days dissecting Google’s Gemma codebase.
What I found changed everything. Thread
Gemma-27B is not just another model it’s a masterclass in disciplined engineering. Its custom T5-derived design, RoPE implementation, and memory discipline are surgical.
I kept my original multi-billion parameter vision but deliberately incorporated every architectural insight that aligned with a clean from-scratch codebase.
Outcome → complete ground-up rewrite. Zero copy-paste. Every line remains mine:
-
attention.py – custom attention block with shared QKV projection, RoPE, and RMSNorm inspired by Gemma’s patterns
- rope_utils.py – positional embeddings derived directly from the original math
- quant_utils.py – 4-bit & 8-bit kernels written from first principles and whitepapers
Training and file structure now follow DeepSeek’s ESFT philosophy:
- Strict separation: config → dataset → model → train/inference API
-
train.py mirrors DeepSeek’s trainer pattern (hooks, resume logic, clean eval loop)
- Entire layout designed for future distributed scaling without refactoring
Current status (honest):
Still chasing NaNs in the hybrid attention backward pass beyond 4096 tokens.
Masking logic and gradient flow are the remaining culprits.
Next 48 h: lock down correct causal padding masks, then push sequence length hard.
Key lesson:
Sometimes the most valuable techniques aren’t hidden in the latest 1T monster they live quietly in the overlooked 27B gems that few bother to study line-by-line. All is100 % custom-written, and battle-tested and will be code will be public soon
AT :-
github.com/RedAILabs/RED
Which lesser-hyped model taught you the most when you actually read its source? Reply I read every one.
Full technical write-up on LinkedIn (link in bio).
#LLMDev #BuildingInPublic #PyTorch #ProjectRedLinkedIn