def maml_simulation() -> None:
x_inner = torch.tensor(2.0, requires_grad=True)
x_outer = torch.tensor(3.0, requires_grad=True)
theta_outer = torch.randn(1, requires_grad=True)
print(f"before training: x inner: {x_inner} theta outer: {theta_outer}")
loss_func = lambda x: (x - 10) ** 2
loss = torch.tensor(0.0, requires_grad=True)
for i in range(5):
theta_outer = zero_grad(theta_outer)
print(f"outer loop, theta before: {theta_outer}")
theta_inner = torch.tensor(theta_outer.item(), requires_grad=True)
for j in range(5):
theta_inner = zero_grad(theta_inner)
prediction = theta_inner * x_inner
inner_loss = loss_func(prediction)
grad = torch.autograd.grad(inner_loss, theta_inner)[0]
with torch.no_grad():
theta_inner = theta_inner - 0.01 * grad
print(f"inner prediction: {prediction}, loss: {inner_loss}, grad: {grad}")
prediction = theta_outer * x_outer
loss = loss_func(prediction)
print(loss.requires_grad, theta_outer.requires_grad)
grad = torch.autograd.grad(loss, theta_outer)[0]
print("outer grad: ", grad)
with torch.no_grad():
theta_outer = theta_outer - 0.01 * grad
print(f"\n\nouter prediction: {prediction}, loss: {loss}, grad: {grad}")
print(f"theta after: {theta_outer}\n\n")

I am going through some meta learning stuff and I want to try and follow the second derivatives of this loop to see what it looks likes and if it is doing what I think it is doing.

I simplified the above code into something more concise that shows what I am trying to do and also shows that it is not happening in pytorch.

By my hand calculation, the second derivative of this at the bottom print statement should be -12.xx but I am getting the first order derivative instead of the second even though I have set create_graph=True. Am I doing something wrong here?

Thanks I’ll look into that package to see if it helps. Do you see any problem with my second derivative above? Flipping the Boolean create_graph doesn’t change the second gradient at all like I would expect

I’m not sure what is the purpose of theta_two in your code above, why not use theta directly (turns out after more investigation below, this was the root of the problem, see the rest of the answer)?

Also if I read correctly, loss_one = (theta * xi - 10)**2.
So grad = 2 * xi * (theta * xi - 10).
So the new theta_two = theta - 0.01 * (2 * xi * (theta * xi - 10)) = theta - 0.02 * theta * xi**2 + 0.2 * xi = theta * (1 - 0.02 * xi**2) + 0.2 * xi.
And loss = ((theta * (1 - 0.02 * xi**2) + 0.2 * xi) * xj - 10)**2 = 1.0816
And its derivative grad = 2 * xj * (1 - 0.02 * xi**2) * ((theta * (1 - 0.02 * xi**2) + 0.2 * xi) * xj - 10)
So the final grad should be -6.822399999999995.

Why do you only see the part that correspond to the last loss computation and your code behaves the same for create_graph=True or create_graph=False?
Because here: grad = torch.autograd.grad(loss, theta_two)[0] you ask for gradients wrt theta_two. But theta_two is the results of theta_two -= 0.01 * grad, so you get gradients wrt to the result of this operation.
If you want gradients wrt to theta, you should use grad = torch.autograd.grad(loss, theta)[0]. Then you will see that the original value of theta_two is needed for the double backward. And you will need to change to theta_two = theta_two - 0.01 * grad.

I see now. I wasn’t able to get the graph working, but I got the code snippet working. Below is the final one with the right derivatives for anyone who finds this later. One thing I don’t get though is why theta -= 0.01 * grad behaves different than theta = theta - 0.01 * grad I though the first one was just a shorthand version and exactly the same. Why did that need to change?

The first one modify the Tensor pointed to by theta inplace. So this is now the new value of this Tensor.

theta = theta - 0.01 * grad creates a new Tensor a associate it with the name “theta”. The Tensor that was originally pointed at by theta is unchanged.

You final code works because you do theta_two = theta - 0.01 * grad and so you keep a reference to the old theta to be able to set it as input for the autograd.grad call.