I need to extract Transformer’s attention weights on GPU, but my results sometimes differ slightly from PyTorch.
Here is my code snippet:
# queries.shape: torch.Size([16, 3, 4, 5])
# keys.shape: torch.Size([16, 3, 4, 5])
# n_fea_hid: 15
# Compute scaled dot-product attention scores
attention_scores = torch.matmul(queries, keys.transpose(-2, -1))
attention_scores.div_(n_fea_hid**0.5)
# pytorch implementation
attention_weights1 = attention_scores.softmax(dim=3)
# My implementation
_attention_scores = attention_scores - attention_scores.max(3, keepdim=True).values
attention_scores_exp = _attention_scores.exp()
attention_weights2 = attention_scores_exp / attention_scores_exp.sum(
dim=3, keepdim=True
)
For the most times, (attention_weights1 == attention_weights2.all()
returns True
, but in rare cases, it returns False
.
When return False
, not all values are different. For example,
attention_weights1[0,0] == attention_weights2[0,0]
returns
tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[1.4901e-08, 2.9802e-08, 2.9802e-08, 0.0000e+00],
[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]], device='cuda:1',
grad_fn=<SubBackward0>)
but
(attention_weights1[0,1] == attention_weights2[0,1]).all()
and
(attention_weights1[0,2] == attention_weights2[0,2]).all()
return True
Where do these minor deviations come from? I googled a lot but still didn’t find the reason.