While this is more or less the main RecSys heuristic, and it is hard to beat, I do think we should try to do better. Being satisfied with a completely ad-hoc solution is not a long term path to progress.
Simple way to replace RL with supervised learning: assign the reward to every action on the path to it and learn to predict it.
Hypothesis: no RL algorithm will ever beat this by much.