Everyone talks model architecture but the real game is in the data trenches. Cleaning, mixing, synthesizing the right stuff decides if your model actually gets smarter or just louder. Most courses skip this because it’s messy and unglamorous.
The big dilemma with teaching an "LLM course" is that it is really easy to get drawn into teaching the various technical things like efficiency tricks, attention variants, PPO vs GRPO, etc etc. But the real "meat" is not there, but in the data: data for pre-training, for mid-training, for SFT, for RL and for "reasoning", synthetic data, curated data, annotated data... cleaning, evaluating, improving, mixing, ... lots of stuff.
but "data" is so much harder to teach: it is not "mathematic" or "algorithmic" like the technical things, and it is not clear what is the teachable thing there. it is also a lot less transparent than the technical topics, both because it is semi-secret, and also because it is also not appealing for publishing, for roughly the same reasons it is not appealing for teaching.
so, what would you teach about data? what are the key lessons and insights one should know? any good papers or resources? good existing classes? blogs? hit me with what you have