Question about the Extending PyTorch tutorial

Aakira · February 1, 2025, 5:17pm

Hi, I’m reading the “Extending PyTorch” tutorial and have the following questions:

In the document, it says:

Tensor arguments that track history (i.e., with requires_grad=True) will be converted to ones that don’t track history before the call, and their use will be registered in the graph.

Does this happen implicitly, or do I need to do this manually? I see in the examples below when passing parameters into forward(), there is no line to disable the gradient tracking. However, further down, it says:

Inputs to backward, i.e., grad_output, can also be tensors that track history. So if backward is implemented with differentiable operations, (e.g., invocation of another custom Function), higher-order derivatives will work.

This is confusing because I don’t understand if I need to disable tracking manually, why isn’t it done in the example? And if this is done implicitly, how can I enable tracking for a tensor?

There’s an example code below that I don’t fully understand the intention of:

class MyCube(torch.autograd.Function):
    @staticmethod
    def forward(x):
        # We wish to save dx for backward. In order to do so, it must
        # be returned as an output.
        dx = 3 * x ** 2
        result = x ** 3
        return result, dx

    @staticmethod
    def setup_context(ctx, inputs, output):
        x, = inputs
        result, dx = output
        ctx.save_for_backward(x, dx)

    @staticmethod
    def backward(ctx, grad_output, grad_dx):
        x, dx = ctx.saved_tensors
        # In order for the autograd.Function to work with higher-order
        # gradients, we must add the gradient contribution of `dx`,
        # which is grad_dx * 6 * x.
        result = grad_output * dx + grad_dx * 6 * x
        return result

# Wrap MyCube in a function so that it is clearer what the output is
def my_cube(x):
    result, dx = MyCube.apply(x)
    return result

Why do we need to return the sum of the first and second-order derivatives? What does this mean?

It seems the documentation isn’t written in a very detailed or easy-to-understand way (or maybe I missed something, if so, please explain and forgive me). Even though I’ve referred to many other sources, I still feel there are many components that aren’t explained but are included as examples (or explained as if they are obvious), for example: In this line, why is needs_input_grad used even though it hasn’t been initialized? Is it automatically assigned a value based on something else?
Which additional sources should I read to better understand these points?

Thanks

soulitzer · February 7, 2025, 10:53pm

Does this happen implicitly, or do I need to do this manually?

The disabling of gradient tracking is done implicitly for you in the scope of forward

puts to backward , i.e., grad_output , can also be tensors that track history.

The above refers to forward, but this refers to the backward pass. It is possible for computation to be enabled, if you are trying to compute higher-order gradients

You can think of dx as just some random output to that function. If someone ended up using dx in the computation of the final loss, the backward function should properly backprop the gradient computation back to x.

Is it automatically assigned a value based on something else?

That is correct. needs_input_grad and other things on ctx are provided for you and have been preinitialized to have some quantities that may be useful during backward.