Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL.
This implementation, however, is quite common in open source RL repos and recent research papers.
In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.