GRU model:one of the variables needed for gradient computation has been modified by an inplace operation

sigma_x · October 2, 2019, 11:50am

This happens in a GRU model, specifically with the GRU layer processing the hidden input, judging by the error report.

GRU layer has two inputs: data (embedding layer, size [sequence length, batch size, embedding features]) and hidden features from the previous step, size [1, batch size, hidden features].

output, hidden = self.gru(x, hidden)

The error trace stack is

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [21, 3, 256]], which is output 0 of CudnnRnnBackward, is at version 1; 
expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

The size of the float tensor points to the hidden state of the GRU. I definitely didn’t use an inplace operation on it anywhere. Other tensors and variables seem to be fine.

So I don’t quite understand what to do here.

albanD · October 2, 2019, 2:34pm

Hi,

Given that it is output 0, isn’t it the output?
Did you make sure that you don’t modify the output inplace either?

Could you share a small code sample that reproduces the issue if you can’t find the problem?

sigma_x · October 2, 2019, 10:14pm

hi @albanD, there are two networks here, both GRU: encoder and decoder. It is the decoder that returns this error. Here’s the code for the encoder:

class EncoderGRU(nn.Module):

     # seq2seq model
     def __init__(self, vocab_size, embedding_size, hidden_size, device, num_layers=1):
         super(EncoderGRU, self).__init__()
         self.embedding = nn.Embedding(vocab_size, embedding_size)
         self.gru = nn.GRU(embedding_size, hidden_size, num_layers, batch_first=False)
         self.device = device

     def forward(self, input_dialogue, hidden=None):
         x = self.embedding(input_dialogue)
         x = x.view(x.size()[1], x.size()[0],-1)
         output, hidden = self.gru(x, hidden)
         return output, hidden

The error refers to the third line in the forward method, because [torch.cuda.FloatTensor [21, 3, 256]] refers to the hidden state (sequence length:21, batch size: 3, number of features: 256), not the embedding vector. Where do I modify hidden inplace?

sigma_x · October 2, 2019, 10:16pm

Also, looking at the error report, I’m not sure I understand what sort of ‘version’ the exception refers to: ‘is at version 1, expected version 0’. What does this mean?

albanD · October 3, 2019, 1:52pm

Versions are a way to track inplace operations. every time a tensor is modified inplace, its version counter is incremented by 1.
You might be modifying the output or hidden inplace after it is returned by the encoder gru. In the rest of your forward pass.

sigma_x · October 3, 2019, 7:32pm

Right! It was modified (unsqueeze) before input inthe decoder, and I forgot to remove the _ inplace symbol! So given all these problem with autograd handling inplace operations, when can I actually use them? Just for the dataset manipulation?

albanD · October 3, 2019, 7:35pm

You can use them in the autograd. But they are not all be allowed. Unfortunately, it is a bit hard to track down what caused the error in general.

sigma_x · October 6, 2019, 11:53am

well unsqueeze_ is definitely one of the more uself in-place operations, and if it is not allowed, I’m not sure when to use it. The one I had a problem with was not the hidden state, as it seemed from the error stack trace: it was the output of the encoder, the full history of the feedforward, size sequence_length x batch_size x hidden_state_size.

albanD · October 6, 2019, 2:19pm

The thing is that a single operation is never disallowed, it is a combinations of operations that can be.

c = a + b
a.div_(2)
(c + a).sum().backward()

This will work, because the value of a is not needed to compute the gradients in the sum operation.

c = a * b
a.div_(2)
(c + a).sum().backward()

This won’t work because the value of a is needed to compute the gradient wrt b in the multiplication.

So a given inplace operation can be allowed or not, depending on the surrounding code.