Strange error: one of the variables needed for gradient computation has been modified by an inplace operation

Hi, everyone. I’m training a transformer for my research work. Here’s the ScaledDotAttention module:

    def forward(self, Q, K, V, d_k, attn_mask=None):
        scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k)
        if attn_mask is not None:
            assert scores.shape == attn_mask.shape
            scores.masked_fill_(attn_mask, -1e20)
        attns = self.dropout(torch.softmax(scores, dim=-1))d_v) -> (batch_size, n_heads, seq_len, d_v)
        # if attn_mask is not None:
        #     attns.masked_fill_(attn_mask, 0.0)
        context = torch.matmul(attns, V)
        return context, attns

The above code works just fine. However, when I commented out attns.masked_fill_(attn_mask, 0.0), an error occurs: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

masked_fill attentions with 0.0 using attn_mask isn’t necessary since corresponding scores have already been masked. But I’m wondering why adding such an operation causes this error. Is anyone have an idea? Thanks!

ps: There’s too many codes in this project, and I can’t post all the codes here or create a minimal error-reproduction snippet. Sorry for that.

btw, I’m 100 percent sure that that line is where the error stems from.

Any methods/functions in PyTorch that end in an underscore represent an inplace version of such operation. For example, multiplication can be done via mul or its inplace equivalent mul_

Thanks for your reply. I know *_ operations are in place, but what bothers me is that why this in place operation leads to this strange error, since this operation happens in the forward pass and loss.backward() happens in the backward pass.

Because you can’t differentiate in-place operations

I don’t think that’s why this error occurs. There’re lots of in-place operations in Transformer, such as the line above scores.masked_fill_(attn_mask, -1e20), but none of them triggers this error. Is there something that I’ve missed?

You would have to check if the manipulated tensor is needed in its original form for the gradient calculation, which disallows inplace operations on it.
This post shows you a simple example.

1 Like