Nan backward for scaled_dot_product_attention

yuzhenmao · December 12, 2023, 10:01pm

Hi,

I found scaled_dot_product_attention can return ‘nan’ in its backward pass after a few iterations. But after changing to the self-implemented self-attention, everything is good. I guess there is some overflow or underflow in scaled_dot_product_attention.

I am using 40G A100

LewsTherin · December 12, 2023, 11:38pm

Could you please elaborate on how you solved this?

yuzhenmao · December 13, 2023, 7:45pm

Hi, I have not solved this issue. My guess is that the backward pass of scaled_dot_product_attention might be unstable for now. But I am not sure.