Gradient computation with PyTorch autograd with 1th and 2th order derivatives does not work

I am having a weird issue with PyTorch’s autograd functionality when implementing a custom loss calculation on a second order differential equation. In the code below, predictions of the neural network are checked if they satisfy a second order differential equation. This works fine. However, when I want to calculate the gradient of the loss with respect to the predictions, I get an error indicating that there seems to be no connection between loss and u in the computational graph.

RuntimeError: One of the differentiated Tensors appears to not have
been used in the graph. Set allow_unused=True if this is the desired
behavior.

This doesn’t make sense because the loss is directly dependent and calculated with the prior derivatives that originate from u. Deriving the loss with respect to u_xx and u_t works, deriving to u_x does NOT. We verified that .requires_grad is set to True for all variables (X, u, u_d, u_x, u_t, u_xx).

Why does this happen, and how to fix this?

Main code:

# Ensure X requires gradients
X.requires_grad_(True)

# Get model predictions
u = self.pinn(X)

# Compute first-order gradients (∂u/∂x and ∂u/∂t)
u_d = torch.autograd.grad(
    u,
    X,
    grad_outputs=torch.ones_like(u),
    retain_graph=True,
    create_graph=True,  # Allow higher-order differentiation
)[0]

# Extract derivatives
u_x, u_t = u_d[:, 0], u_d[:, 1]  # ∂u/∂x and ∂u/∂t

# Compute second-order derivative ∂²u/∂x²
u_xx = torch.autograd.grad(
    u_x,
    X,
    grad_outputs=torch.ones_like(u_x),
    retain_graph=True,
    create_graph=True,
)[0][:, 0]

# Diffusion equation (∂u/∂t = κ * ∂²u/∂x²)
loss = nn.functional.mse_loss(u_t, self.kappa * u_xx)

## THIS FAILS
# Compute ∂loss/∂u
loss_u = torch.autograd.grad(
    loss,
    u,
    grad_outputs=torch.ones_like(loss),
    retain_graph=True,
    create_graph=True,
)[0]

# Return error on diffusion equation
return loss

Model:

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
Sequential                               [1, 1]                    --
├─Linear: 1-1                            [1, 50]                   150
├─Tanh: 1-2                              [1, 50]                   --
├─Linear: 1-3                            [1, 50]                   2,550
├─Tanh: 1-4                              [1, 50]                   --
├─Linear: 1-5                            [1, 50]                   2,550
├─Tanh: 1-6                              [1, 50]                   --
├─Linear: 1-7                            [1, 50]                   2,550
├─Tanh: 1-8                              [1, 50]                   --
├─Linear: 1-9                            [1, 1]                    51
==========================================================================================
Total params: 7,851
Trainable params: 7,851
Non-trainable params: 0
Total mult-adds (M): 0.01
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.03
Estimated Total Size (MB): 0.03
==========================================================================================

What we have already tried:

Reverted to an older PyTorch version (tested on 2.5.0, and 1.13.1). Same issue.

Putting .requires_grad_(True) after every variable assignment. This did not help.

We also tried to replace the tensor slicing by multiplying with zero/one vectors without results. We though this slicing might disturb the computational graph breaking the connection to u.

# Extract derivatives
u_x, u_t = u_d[:, 0], u_d[:, 1]  # ∂u/∂x and ∂u/∂t

# Extract derivatives alternative
u_x = torch.sum(
    torch.reshape(torch.tensor([1, 0], device=u_d.device), [1, -1]) * u_d,
    dim=1,
    keepdim=True,
)
u_t = ...

Thanks for your help!