My code did not return the same result as pytorch's softmax

I need to extract Transformer’s attention weights on GPU, but my results sometimes differ slightly from PyTorch.

Here is my code snippet:

# queries.shape: torch.Size([16, 3, 4, 5])
# keys.shape: torch.Size([16, 3, 4, 5])
# n_fea_hid: 15

# Compute scaled dot-product attention scores
attention_scores = torch.matmul(queries, keys.transpose(-2, -1))

# pytorch implementation
attention_weights1 = attention_scores.softmax(dim=3)

# My implementation
_attention_scores = attention_scores - attention_scores.max(3, keepdim=True).values
attention_scores_exp = _attention_scores.exp()
attention_weights2 = attention_scores_exp / attention_scores_exp.sum(
dim=3, keepdim=True

For the most times, (attention_weights1 == attention_weights2.all() returns True, but in rare cases, it returns False.

When return False, not all values are different. For example,

attention_weights1[0,0] == attention_weights2[0,0]


tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [1.4901e-08, 2.9802e-08, 2.9802e-08, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]], device='cuda:1',


(attention_weights1[0,1] == attention_weights2[0,1]).all()


(attention_weights1[0,2] == attention_weights2[0,2]).all()

return True

Where do these minor deviations come from? I googled a lot but still didn’t find the reason.

Hi Stu!

These are numerical round-off errors (and are to be expected).

While your two computations of attention_weights are mathematically
equivalent, they differ numerically. You can verify and explore this by
repeating the computations using torch.float64 (“double precision”)
and you will see the differences reduced by several orders of magnitude.


K. Frank

As you mentioned, changing the precision to double() will reduce this difference by several orders of magnitude.

Another question: if I compute these exponential numbers first using torch on GPU, based upon them, then I use numpy matrix multiplication to calculate softmax and Attention logits and then make predictions (load torch’s Linear weights under the same float32 precision), will these minor differences influence the probability distribution?

Hi Stu!

Note, if you perform part of your computation with numpy instead of
pytorch, you won’t be able to use autograd to backpropagate through
that part of the computation (unless you write your own backward()
function to support that part of the backward pass).

(I don’t really understand what you’re asking here. Why would you do
things this way? Why not use pytorch for all of the forward pass?)

You would expect “these minor differences” to have a minor effect on
“the probability distribution,” but you probably don’t care about such
minor differences. If they do turn out to be causing a problem (unlikely),
I would recommend performing that part of the computation (or perhaps
the whole computation) with pytorch in double precision (torch.float64).

Note that often when two mathematically-equivalent computations (that
are numerically different) produce results that differ by round-off error
near the level of machine precision, it is not the case that one result is
more correct than the other. They are both likely to be equally good
floating-point approximations to the exact mathematical result.


K. Frank