Hi, I’m reading the “Extending PyTorch” tutorial and have the following questions:
- In the document, it says:
Tensor
arguments that track history (i.e., withrequires_grad=True
) will be converted to ones that don’t track history before the call, and their use will be registered in the graph.
Does this happen implicitly, or do I need to do this manually? I see in the examples below when passing parameters into forward()
, there is no line to disable the gradient tracking. However, further down, it says:
Inputs to
backward
, i.e.,grad_output
, can also be tensors that track history. So ifbackward
is implemented with differentiable operations, (e.g., invocation of another customFunction
), higher-order derivatives will work.
This is confusing because I don’t understand if I need to disable tracking manually, why isn’t it done in the example? And if this is done implicitly, how can I enable tracking for a tensor?
- There’s an example code below that I don’t fully understand the intention of:
class MyCube(torch.autograd.Function):
@staticmethod
def forward(x):
# We wish to save dx for backward. In order to do so, it must
# be returned as an output.
dx = 3 * x ** 2
result = x ** 3
return result, dx
@staticmethod
def setup_context(ctx, inputs, output):
x, = inputs
result, dx = output
ctx.save_for_backward(x, dx)
@staticmethod
def backward(ctx, grad_output, grad_dx):
x, dx = ctx.saved_tensors
# In order for the autograd.Function to work with higher-order
# gradients, we must add the gradient contribution of `dx`,
# which is grad_dx * 6 * x.
result = grad_output * dx + grad_dx * 6 * x
return result
# Wrap MyCube in a function so that it is clearer what the output is
def my_cube(x):
result, dx = MyCube.apply(x)
return result
Why do we need to return the sum of the first and second-order derivatives? What does this mean?
- It seems the documentation isn’t written in a very detailed or easy-to-understand way (or maybe I missed something, if so, please explain and forgive me). Even though I’ve referred to many other sources, I still feel there are many components that aren’t explained but are included as examples (or explained as if they are obvious), for example: In this line, why is
needs_input_grad
used even though it hasn’t been initialized? Is it automatically assigned a value based on something else?
Which additional sources should I read to better understand these points?
Thanks