2nd derivative for attn is not working

Hi,

I have a Transformer-based energy-based model (built with PyTorch nn.TransformerEncoderLayer). I am computing the derivative of the model in the loss definition and then using loss.backward() to train the model. Basically, I need to take the 2nd derivative of the model.

To do this, with PyTorch 2.1.0, I met the error RuntimeError: derivative for aten::_scaled_dot_product_efficient_attention_backward is not implemented. But it was working when I used PyTorch 1.13.0.

Is there any solution for this with PyTorch 2.1.0? Thank you!

1 Like

I have exact the same issue. Does anyone know how to solve this? Thanks a lot!

Writing a Transformer encoder with plain PyTorch operations (matmul etc) can bypass this.