Since a number of people asked, this is what I mean by deep learning way of understanding reinforcement learning.
It's all about one question: how can I differentiate through the total reward my agent is collecting?
The total reward (i.e., the value function) is the objective function of RL. When we see an objective function in deep learning, our goal is to differentiate through it to compute a gradient. You can view many RL techniques and fundamental problems as ways to fulfill that unstoppable desire to differentiate through the value function.
If the world your agent is operating in is differentiable you can do it the easy way. If everything is deterministic or reparameterizable à la VAE, you have a complete computational graph. Just compute that gradient and push that value function up.
What if you don't have access to the world dynamics? You still want to compute that gradient. Learn a world model from data, then you will generate a valid computational graph and it will give you a gradient to push that value function up.
Your world model doesn't work as expected or it's computationally expensive? You can predict the value in a direct way with a critic, a value function approximator, and differentiate through it. With a critic, you can just do one forward and a backward pass. Direct gradient to your agent. No compounding errors, no messed up gradients, nothing too expensive.
Learning a critic by just predict the rewards you are observing is requiring too much data? You can use temporal difference, it's a nice concept that exploits the structure of decision-making to speed up learning for a critic. Then you will be able to compute your gradient.
You can understand a lot about reinforcement learning as just a quest to compute that gradient of the value function. It's not all of it, but I find it always useful to think about it.