2nd derivative for attn is not working

Hi,

I have a Transformer-based energy-based model (built with PyTorch nn.TransformerEncoderLayer). I am computing the derivative of the model in the loss definition and then using loss.backward() to train the model. Basically, I need to take the 2nd derivative of the model.

To do this, with PyTorch 2.1.0, I met the error RuntimeError: derivative for aten::_scaled_dot_product_efficient_attention_backward is not implemented. But it was working when I used PyTorch 1.13.0.

Is there any solution for this with PyTorch 2.1.0? Thank you!