Word_language_model example and detach

ratishsp · December 31, 2018, 12:19pm

Hi,
In the word_language_model example, after every bptt tokens, h is detached

pytorch/examples/blob/e0929a4253f9ae6ccdde24e787788a9955fdfe1c/word_language_model/main.py#L106




criterion = nn.CrossEntropyLoss()


###############################################################################
# Training code
###############################################################################


def repackage_hidden(h):
"""Wraps hidden states in new Tensors, to detach them from their history."""
if isinstance(h, torch.Tensor):
    return h.detach()
else:
    return tuple(repackage_hidden(v) for v in h)




# get_batch subdivides the source data into chunks of length args.bptt.
# If source is equal to the example output of the batchify function, with
# a bptt-limit of 2, we'd get the following two Variables for i = 0:
# ┌ a g m s ┐ ┌ b h n t ┐
# └ b h n t ┘ └ c i o u ┘
# Note that despite the name of the function, the subdivison of data is not

I understand that this is needed to prevent gradients to flow to the start of the sequence. But detach in turn, resets requires_grad to False. So from the second bptt set onwards the gradients will not be computed for h. I think that we do need gradients for h throughout. Please clarify this.

Thanks.