I implement a backward function of pos embedding lookup, and I call torch.allclose to verify it with autograd result. My implementation is very simple, as follows:
# pos_matrix: [seq_len, seq_len]
# pos_weight: [max_pos_size, hidden_size]
# pos_embed: [seq_len, seq_len, hidden_size]
grad_weight = torch.zeros((max_pos_size, hidden_size))
for i in range(seq_len):
for j in range(seq_len):
pos = pos_matrix[i, j]
grad_weight[pos] += grad_embed[i, j]
It is consistency under FP32(difference < 1e-5), but there is a big difference on FP16(Max absolute difference: 0.0625), and this difference increases as the length of the sequence increases.
What could cause this difference? The difference of calculation order in cuda code? or fp16 overflow/rounding error? or is there a bug in my implementation?