Hi, everyone. I’m training a transformer for my research work. Here’s the
def forward(self, Q, K, V, d_k, attn_mask=None): scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k) if attn_mask is not None: assert scores.shape == attn_mask.shape scores.masked_fill_(attn_mask, -1e20) attns = self.dropout(torch.softmax(scores, dim=-1))d_v) -> (batch_size, n_heads, seq_len, d_v) # if attn_mask is not None: # attns.masked_fill_(attn_mask, 0.0) context = torch.matmul(attns, V) return context, attns
The above code works just fine. However, when I commented out
attns.masked_fill_(attn_mask, 0.0), an error occurs: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
masked_fill attentions with 0.0 using attn_mask isn’t necessary since corresponding scores have already been masked. But I’m wondering why adding such an operation causes this error. Is anyone have an idea? Thanks!
ps: There’s too many codes in this project, and I can’t post all the codes here or create a minimal error-reproduction snippet. Sorry for that.