GRPO, DPO, PPO, RLHF are the algorithms behind every major LLM alignment pipeline and if you really want to understand how a base model becomes ChatGPT, you need to implement them yourself.
We just shipped a full RL track on TensorTonic covering all of it.
RLHF with KL penalties, DPO, GRPO, RLOO, PPO clipped surrogate, Actor-Critic, REINFORCE, GAE, all the way down to the fundamentals like DQN variants, Q-Learning, SARSA, Monte Carlo methods, and multi-armed bandits.
tensortonic.com