In the word_language_model example, after every bptt tokens, h is detached
I understand that this is needed to prevent gradients to flow to the start of the sequence. But detach in turn, resets requires_grad to False. So from the second bptt set onwards the gradients will not be computed for h. I think that we do need gradients for h throughout. Please clarify this.