Nan backward for scaled_dot_product_attention

Hi,

I found scaled_dot_product_attention can return ‘nan’ in its backward pass after a few iterations. But after changing to the self-implemented self-attention, everything is good. I guess there is some overflow or underflow in scaled_dot_product_attention.

I am using 40G A100

Could you please elaborate on how you solved this?

Hi, I have not solved this issue. My guess is that the backward pass of scaled_dot_product_attention might be unstable for now. But I am not sure.