Word_language_model example and detach


(Ratish Puduppully) #1

Hi,
In the word_language_model example, after every bptt tokens, h is detached


I understand that this is needed to prevent gradients to flow to the start of the sequence. But detach in turn, resets requires_grad to False. So from the second bptt set onwards the gradients will not be computed for h. I think that we do need gradients for h throughout. Please clarify this.

Thanks.