The backward realized by myself is inconsistent with autograd under fp16

nobody · August 19, 2021, 5:26am

I implement a backward function of pos embedding lookup, and I call torch.allclose to verify it with autograd result. My implementation is very simple, as follows:

# pos_matrix: [seq_len, seq_len]
# pos_weight: [max_pos_size, hidden_size]
# pos_embed: [seq_len, seq_len, hidden_size]
grad_weight = torch.zeros((max_pos_size, hidden_size))
for i in range(seq_len):
    for j in range(seq_len):
        pos = pos_matrix[i, j]
        grad_weight[pos] += grad_embed[i, j]

It is consistency under FP32(difference < 1e-5), but there is a big difference on FP16(Max absolute difference: 0.0625), and this difference increases as the length of the sequence increases.

What could cause this difference? The difference of calculation order in cuda code? or fp16 overflow/rounding error? or is there a bug in my implementation?

ptrblck · August 19, 2021, 5:31am

A larger error using float16 would be expected and for your particular use case you could check the errors in FP16, FP32, and FP64 to check, if the relative (and abs.) errors decrease in the expected ranges.