I implement a backward function of pos embedding lookup, and I call torch.allclose to verify it with autograd result. My implementation is very simple, as follows:
# pos_matrix: [seq_len, seq_len] # pos_weight: [max_pos_size, hidden_size] # pos_embed: [seq_len, seq_len, hidden_size] grad_weight = torch.zeros((max_pos_size, hidden_size)) for i in range(seq_len): for j in range(seq_len): pos = pos_matrix[i, j] grad_weight[pos] += grad_embed[i, j]
It is consistency under FP32(difference < 1e-5), but there is a big difference on FP16(Max absolute difference: 0.0625), and this difference increases as the length of the sequence increases.
What could cause this difference? The difference of calculation order in cuda code? or fp16 overflow/rounding error? or is there a bug in my implementation?