The backward realized by myself is inconsistent with autograd under fp16

I implement a backward function of pos embedding lookup, and I call torch.allclose to verify it with autograd result. My implementation is very simple, as follows:

# pos_matrix: [seq_len, seq_len]
# pos_weight: [max_pos_size, hidden_size]
# pos_embed: [seq_len, seq_len, hidden_size]
grad_weight = torch.zeros((max_pos_size, hidden_size))
for i in range(seq_len):
    for j in range(seq_len):
        pos = pos_matrix[i, j]
        grad_weight[pos] += grad_embed[i, j]

It is consistency under FP32(difference < 1e-5), but there is a big difference on FP16(Max absolute difference: 0.0625), and this difference increases as the length of the sequence increases.

What could cause this difference? The difference of calculation order in cuda code? or fp16 overflow/rounding error? or is there a bug in my implementation?

A larger error using float16 would be expected and for your particular use case you could check the errors in FP16, FP32, and FP64 to check, if the relative (and abs.) errors decrease in the expected ranges.