Help clarifying repackage_hidden in word_language_model

Hi,
In the example of word_language_model, we have

def repackage_hidden(h):
    """Wraps hidden states in new Variables, to detach them from their history."""
    if type(h) == Variable:
        return Variable(h.data)
    else:
        return tuple(repackage_hidden(v) for v in h)

I dont think I fully understand what the “history” includes, can somebody helps clarify this?

Thanks!

3 Likes

Every variable has a .creator attribute that is an entry point to a graph, that encodes the operation history. This allows autograd to replay it and differentiate each op. So each hidden state will have a reference to some graph node that has created it, but in that example you’re doing BPTT, so you never want to backprop to it after you finish the sequence. To get rid of the reference, you have to take out the tensor containing the hidden state h.data and wrap it in a fresh Variable, that has no history (is a graph leaf). This allows the previous graph to go out of scope and free up the memory for next iteration.

18 Likes

I was going to add that .detach() does the same thing, but I checked the code and realized that I’m not at all sure about the semantics of var2 = var1.detach() vs var2 = Variable(var1.data)

2 Likes

Right now the difference is that .detach() still retains the reference, but it should be fixed.

It will change once more when we add lazy execution. In eager mode, it will stay as is (always discard the .creator and mark as not requiring grad). In lazy mode var1.detach() won’t trigger the compute and will save the reference, while Variable(var1.data) will trigger it, because you’re accessing the data.

7 Likes

So we do not need to repackage hidden state when making predictions ,since we don’t do a BPTT ?

For any latecomers, Variable object does not have creator attribute any more, which is renamed to grad_fn. You can see here for more information.

Shouldn’t the code set requires_grad=True to the hidden state as shown below?
As per my understanding, each bptt set should be able to have gradients computed for h.

def repackage_hidden(h):
    """Wraps hidden states in new Variables, to detach them from their history."""
    if type(h) == Variable:
        return Variable(h.data, requires_grad=True)
    else:
        return tuple(repackage_hidden(v) for v in h)

Thanks.

it has already been updated to be compatible with the latest PyTorch version:

def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)