[solved] Why we need to detach Variable which contains hidden representation?

I was going through the pytorch official example - “word_language_model” and found the following line of code in the train() function.

# Starting each batch, we detach the hidden state from how it was previously produced.
# If we didn't, the model would try backpropagating all the way to start of the dataset.
hidden = repackage_hidden(hidden)

I am not understanding why we need to detach hidden variable from the hidden variable associated with the previous batch of input? When gradients are computed and loss is backpropagated, weight matrices are usually affected by the chain of computation.

Why the hidden variables which represents hidden states of a recurrent neural network is at a risk to get affected during backpropagation and should be detached from previous value?

When the hidden variable should be detached from previous value? At the beginning of each batch? or beginning of each training epoch?


This is for doing truncated BPTT. The dependency graph of RNN can be simply viewed as this.

c1, h1 -> c2, h2 -> c3, h3 -> ... -> cn, hn
    |        |         |                |
  loss1    loss2     loss3            lossn

If we did not truncate the history of hidden states (c, h), the back-propagated gradients would flow from the loss towards the beginning, which may result in gradient vanishing or exploding. Detailed explanation can be found here.


Why the hidden variables which represents hidden states of a recurrent neural network is at a risk to get affected during backpropagation and should be detached from previous value?

As in the diagram below, the error E3 depends on hidden state s3, which depends on s2, which depends on s1 and so on. At timestep t=100, to compute the gradients we will have to consider last 100 states.

We don’t want that: we want to assume that for some k, the hidden state s[t-k] is constant and does not depend on anything.

To achieve this we take a batch of k words and simply assume that the initial hidden state for this batch is constant and doesn’t depend on anything.

But for pytorch, if you call hidden = model(batch, hidden), all of those hidden states are connected. So at the start of each batch you have to manually tell pytorch: “here’s the hidden state from previous batch, but consider it constant”.

I believe you could simply call hidden.detach_() though, no need to create new Variable. (@smth ?)

When the hidden variable should be detached from previous value? At the beginning of each batch?


Image source and a nice blog post about backprop through time: LINK

EDIT Oops, I see @chenzhekl’s already answered that when I was writing. Sorry for duplicating.


Hello, thanks for the nice answer, it explains a lot! But I still have a question:

Although in the example given by @wasiahmad the hidden states are detached in train() (using repackeage_hidden() ), in some examples of nn.lstm(), it’s just simply used as:

def forward(self, input):
outputs, (h_n, c_n) = self.lstm(input)
return outputs

, where input is tensor of sequences (or a packed sequence), and the initial hidden states h_0, c_0 are initialized automatically.

In the example above, the hidden states are not detached manually. I am wondering if the attachment is done automatically here, or BPTT is computed based on all previous hidden states without doing any detachments? Thanks.

1 Like

In your examples new zero-initialized hidden states are initialized at every call to the lstm, and therefore are not connected to the previous sequence in any way. The language modeling example needs to detach them, beacuse it retains the values of hidden states between training states (but doesn’t want to backprop very far back).


Understood. Many thanks.

Hello, I come up with a specific question about the detachment op we discuss here.

For example, we have a seq2seq model with an attention layer between the encoder and the decoder. According to the common implementations of attention models, the last hidden state of the encoder (say, ‘hn’) is used as the first hidden state of the decoder.

My question is: is it necessary to detach hn from the encoder network?

It depends what you want to do. If you detach it, the encoder won’t get any gradients from that backward.

Then I suppose detaching the hidden variable from the graph during the evaluation stage is not necessary, or is it?

def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    total_loss = 0
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(eval_batch_size)
    for i in range(0, data_source.size(0) - 1, args.bptt):
        data, targets = get_batch(data_source, i, evaluation=True)
        output, hidden = model(data, hidden)
        output_flat = output.view(-1, ntokens)
        total_loss += len(data) * criterion(output_flat, targets).data
        hidden = repackage_hidden(hidden)
    return total_loss[0] / len(data_source)

source: https://github.com/pytorch/examples/blob/master/word_language_model/main.py

During evaluation detaching is not necessary. When you evaluate there is no need to compute the gradients nor backpropagate anything.

So, afaik just put your input variable as volatile and Pytorch won’t hesitate to create the backpropagation graph, it will just do a forward pass.

hi Adam,

how do we detach the states from the output of GRU layer:

def detach_states(self, states):
	if states is None:
		return states
	return [state.detach() for state in states]

are output from the GRU layer. When I pass them to

self.gru(X, initial_states)

It gives me following error: …

File “/Users/user_name/anaconda/lib/python3.5/site-packages/torch/nn/modules/rnn.py”, line 162, in check_forward_args
check_hidden_size(hidden, expected_hidden_size)

File “/Users/user_name/anaconda/lib/python3.5/site-packages/torch/nn/modules/rnn.py”, line 153, in check_hidden_size
if tuple(hx.size()) != expected_hidden_size:
AttributeError: ‘list’ object has no attribute ‘size’

What I am doing wrong. I can put the entire code if this is not clear.

In fact, if you call


at the beginning of each batch during training, then there will be no need to call



I honestly feel a bit disappointed by such ducktaping. Why does a user need to know OR even keep repackage_hidden(h) in his code? This recurrency implementation is specific to pytorch and is just confusing considering that both Tensorflow and even older Torch never expose it.

I would say it depends on your use case and maybe your workflow.
E.g. how would you like to expose the functionality of:

  • backpropagating through all seen data (i.e. in PyTorch just don’t detach the hidden state)
  • use only the last input batch?

New proposals for these use cases (and UX) are always welcome. :slight_smile:

Hi there! I would like to know if the first case in your comment is at all possible in PyTorch? I’ve just created a thread on a very similar use case: How to backpropagate a loss through time-series RNN?, and I was wondering whether one can actually perform online training with RNNs where the hidden states are never detached (i.e. backpropagating through all seen data) without stumbling upon autograd errors. Thank you!