I am excited to be giving a 4-hour tutorial on "Pretraining and Finetuning LLMs from the Ground Up" at the
@SciPyConf conference in 5 days!
This tutorial is aimed at coders interested in understanding the building blocks of large language models (LLMs), how LLMs work, and how to code them from the ground up in PyTorch. After grasping how everything fits together and how to pretrain an LLM, we will learn how to load pretrained weights and finetune LLMs using open-source libraries.
I am currently putting the final touches on the code and will share it along with a reproducible environment soon (
github.com/rasbt/LLM-worksho…).
The (I hope not too ambitious) schedule is as follows:
1) Introduction to LLMs: An introduction to the workshop, covering LLMs, the topics being discussed, and setup instructions.
2) Understanding LLM Input Data: In this section, we will code the text input pipeline by implementing a text tokenizer and a custom PyTorch DataLoader for our LLM.
3) Coding an LLM Architecture: We will go over the individual building blocks of LLMs and assemble them in code. We won't cover all modules in meticulous detail but will focus on the bigger picture and how to assemble them into a GPT-like model.
4) Pretraining LLMs: We will cover the pretraining process of LLMs and implement the code to pretrain the model architecture we created. Since pretraining is expensive, we will only pretrain it on a small text sample available in the public domain so that the LLM is capable of generating some basic sentences.
5) Loading Pretrained Weights: Due to the lengthy and expensive nature of pretraining, we will load pretrained weights into our self-implemented architecture. We will introduce the LitGPT open-source library, which provides more sophisticated (but still readable) code for training and finetuning LLMs. We will learn how to load weights of pretrained LLMs (Llama, Phi, Gemma, Mistral) in LitGPT.
6) Finetuning LLMs: This section will introduce LLM finetuning techniques. We will prepare a small dataset for instruction finetuning, which we will then use to finetune an LLM in LitGPT.
I know I say this every year, but I am really excited to be returning to my favorite conference once more! It's going to be my fifth SciPy this year, and I am thrilled to see it at a new location (Tacoma/Seattle) this time!