Strange error: one of the variables needed for gradient computation has been modified by an inplace operation

Rancho_Xia · December 22, 2021, 9:02am

Hi, everyone. I’m training a transformer for my research work. Here’s the ScaledDotAttention module:

    def forward(self, Q, K, V, d_k, attn_mask=None):
        scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k)
        if attn_mask is not None:
            assert scores.shape == attn_mask.shape
            scores.masked_fill_(attn_mask, -1e20)
        attns = self.dropout(torch.softmax(scores, dim=-1))d_v) -> (batch_size, n_heads, seq_len, d_v)
        # if attn_mask is not None:
        #     attns.masked_fill_(attn_mask, 0.0)
        context = torch.matmul(attns, V)
        return context, attns

The above code works just fine. However, when I commented out attns.masked_fill_(attn_mask, 0.0), an error occurs: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

masked_fill attentions with 0.0 using attn_mask isn’t necessary since corresponding scores have already been masked. But I’m wondering why adding such an operation causes this error. Is anyone have an idea? Thanks!

ps: There’s too many codes in this project, and I can’t post all the codes here or create a minimal error-reproduction snippet. Sorry for that.

Rancho_Xia · December 22, 2021, 9:03am

btw, I’m 100 percent sure that that line is where the error stems from.

AlphaBetaGamma96 · December 22, 2021, 11:08am

Any methods/functions in PyTorch that end in an underscore represent an inplace version of such operation. For example, multiplication can be done via mul or its inplace equivalent mul_

Rancho_Xia · December 22, 2021, 4:22pm

Thanks for your reply. I know *_ operations are in place, but what bothers me is that why this in place operation leads to this strange error, since this operation happens in the forward pass and loss.backward() happens in the backward pass.

AlphaBetaGamma96 · December 22, 2021, 4:56pm

Because you can’t differentiate in-place operations

Rancho_Xia · December 23, 2021, 10:22am

I don’t think that’s why this error occurs. There’re lots of in-place operations in Transformer, such as the line above scores.masked_fill_(attn_mask, -1e20), but none of them triggers this error. Is there something that I’ve missed?

ptrblck · December 23, 2021, 11:47am

You would have to check if the manipulated tensor is needed in its original form for the gradient calculation, which disallows inplace operations on it.
This post shows you a simple example.