Hi, everyone. I’m training a transformer for my research work. Here’s the ScaledDotAttention
module:
def forward(self, Q, K, V, d_k, attn_mask=None):
scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k)
if attn_mask is not None:
assert scores.shape == attn_mask.shape
scores.masked_fill_(attn_mask, -1e20)
attns = self.dropout(torch.softmax(scores, dim=-1))d_v) -> (batch_size, n_heads, seq_len, d_v)
# if attn_mask is not None:
# attns.masked_fill_(attn_mask, 0.0)
context = torch.matmul(attns, V)
return context, attns
The above code works just fine. However, when I commented out attns.masked_fill_(attn_mask, 0.0)
, an error occurs: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
masked_fill attentions with 0.0 using attn_mask isn’t necessary since corresponding scores have already been masked. But I’m wondering why adding such an operation causes this error. Is anyone have an idea? Thanks!
ps: There’s too many codes in this project, and I can’t post all the codes here or create a minimal error-reproduction snippet. Sorry for that.