How to print gradient graph

def maml_simulation() -> None:
    x_inner = torch.tensor(2.0, requires_grad=True)
    x_outer = torch.tensor(3.0, requires_grad=True)
    theta_outer = torch.randn(1, requires_grad=True)
    print(f"before training: x inner: {x_inner} theta outer: {theta_outer}")

    loss_func = lambda x: (x - 10) ** 2

    loss = torch.tensor(0.0, requires_grad=True)
    for i in range(5):
        theta_outer = zero_grad(theta_outer)

        print(f"outer loop, theta before: {theta_outer}")
        theta_inner = torch.tensor(theta_outer.item(), requires_grad=True)
        for j in range(5):
            theta_inner = zero_grad(theta_inner)

            prediction = theta_inner * x_inner
            inner_loss = loss_func(prediction)

            grad = torch.autograd.grad(inner_loss, theta_inner)[0]
            with torch.no_grad():
                theta_inner = theta_inner - 0.01 * grad
            print(f"inner prediction: {prediction}, loss: {inner_loss}, grad: {grad}")

        prediction = theta_outer * x_outer
        loss = loss_func(prediction)
        print(loss.requires_grad, theta_outer.requires_grad)
        grad = torch.autograd.grad(loss, theta_outer)[0]
        print("outer grad: ", grad)

        with torch.no_grad():
            theta_outer = theta_outer - 0.01 * grad
        print(f"\n\nouter prediction: {prediction}, loss: {loss}, grad: {grad}")
        print(f"theta after: {theta_outer}\n\n")

I am going through some meta learning stuff and I want to try and follow the second derivatives of this loop to see what it looks likes and if it is doing what I think it is doing.

  1. Does this code calulate a second derivative as is stated in this paper? https://arxiv.org/abs/1703.03400

  2. How can I print out the graph or verify somehow that it is doing what I think it is doing?

Thanks

I simplified the above code into something more concise that shows what I am trying to do and also shows that it is not happening in pytorch.

By my hand calculation, the second derivative of this at the bottom print statement should be -12.xx but I am getting the first order derivative instead of the second even though I have set create_graph=True. Am I doing something wrong here?

def maml_simulation() -> None:
    x_i = torch.tensor(3.0, requires_grad=True)
    x_j = torch.tensor(4.0, requires_grad=True)
    theta = torch.tensor(2.0, requires_grad=True)
    theta_two = theta.clone()
    loss_func = lambda x: (x - 10) ** 2
    print(f"before training: theta: {theta}")

    prediction = theta_two * x_i
    loss_one = loss_func(prediction)

    grad = torch.autograd.grad(loss_one, theta_two, create_graph=True)[0]
    theta_two -= 0.01 * grad
    print(
        f"first prediction: {prediction}, loss: {loss_one}, grad: {grad}, theta after update: {theta_two}"
    )

    prediction = theta_two * x_j
    loss = loss_func(prediction)
    grad = torch.autograd.grad(loss, theta_two)[0]
    print(f"second prediction: {prediction}, loss: {loss}, second grad: {grad}")

Hi,

You can use the torchviz package to print the graph corresponding to gradient computations.

Thanks I’ll look into that package to see if it helps. Do you see any problem with my second derivative above? Flipping the Boolean create_graph doesn’t change the second gradient at all like I would expect

I’m not sure what is the purpose of theta_two in your code above, why not use theta directly (turns out after more investigation below, this was the root of the problem, see the rest of the answer)?

Also if I read correctly, loss_one = (theta * xi - 10)**2.
So grad = 2 * xi * (theta * xi - 10).
So the new theta_two = theta - 0.01 * (2 * xi * (theta * xi - 10)) = theta - 0.02 * theta * xi**2 + 0.2 * xi = theta * (1 - 0.02 * xi**2) + 0.2 * xi.
And loss = ((theta * (1 - 0.02 * xi**2) + 0.2 * xi) * xj - 10)**2 = 1.0816
And its derivative grad = 2 * xj * (1 - 0.02 * xi**2) * ((theta * (1 - 0.02 * xi**2) + 0.2 * xi) * xj - 10)
So the final grad should be -6.822399999999995.

Why do you only see the part that correspond to the last loss computation and your code behaves the same for create_graph=True or create_graph=False?
Because here: grad = torch.autograd.grad(loss, theta_two)[0] you ask for gradients wrt theta_two. But theta_two is the results of theta_two -= 0.01 * grad, so you get gradients wrt to the result of this operation.
If you want gradients wrt to theta, you should use grad = torch.autograd.grad(loss, theta)[0]. Then you will see that the original value of theta_two is needed for the double backward. And you will need to change to theta_two = theta_two - 0.01 * grad.

Hope this helps.

I see now. I wasn’t able to get the graph working, but I got the code snippet working. Below is the final one with the right derivatives for anyone who finds this later. One thing I don’t get though is why
theta -= 0.01 * grad behaves different than theta = theta - 0.01 * grad I though the first one was just a shorthand version and exactly the same. Why did that need to change?

def maml_simulation() -> None:
    x_i = torch.tensor(3.0, requires_grad=True)
    x_j = torch.tensor(4.0, requires_grad=True)
    theta = torch.tensor(2.0, requires_grad=True)
    loss_func = lambda x: (x - 10) ** 2
    print(f"before training: theta: {theta}")

    prediction = theta * x_i
    loss_one = loss_func(prediction)

    grad = torch.autograd.grad(loss_one, theta, create_graph=False)[0]
    theta_two = theta - 0.01 * grad
    print(
        f"first prediction: {prediction}, loss: {loss_one}, grad: {grad}, theta after update: {theta_two}"
    )

    prediction = theta_two * x_j
    loss = loss_func(prediction)
    grad = torch.autograd.grad(loss, theta)[0]
    print(f"second prediction: {prediction}, loss: {loss}, second grad: {grad}")

They are different:

  • The first one modify the Tensor pointed to by theta inplace. So this is now the new value of this Tensor.
  • theta = theta - 0.01 * grad creates a new Tensor a associate it with the name “theta”. The Tensor that was originally pointed at by theta is unchanged.
  • You final code works because you do theta_two = theta - 0.01 * grad and so you keep a reference to the old theta to be able to set it as input for the autograd.grad call.