Oh, to me it is the opposite. LLM RL is when you say supervised fine-tuning (SFT) instead of behavior cloning, RLVF instead of batch policy optimization,
base policy instead of behavior policy, trace instead of trajectory, verifiable reward instead of reward, .. LOL