Hi,
I found scaled_dot_product_attention can return ‘nan’ in its backward pass after a few iterations. But after changing to the self-implemented self-attention, everything is good. I guess there is some overflow or underflow in scaled_dot_product_attention.
I am using 40G A100