Why `input` is tensor in the forward function when extending torch.autograd

# Inherit from Function
class LinearFunction(Function):

    # Note that both forward and backward are @staticmethods
    @staticmethod
    # bias is an optional argument
    def forward(ctx, input, weight, bias=None):
        ctx.save_for_backward(input, weight, bias)
        output = input.mm(weight.t())
        if bias is not None:
            output += bias.unsqueeze(0).expand_as(output)
        return output

    # This function has only a single output, so it gets only one gradient
    @staticmethod
    def backward(ctx, grad_output):
        # This is a pattern that is very convenient - at the top of backward
        # unpack saved_tensors and initialize all gradients w.r.t. inputs to
        # None. Thanks to the fact that additional trailing Nones are
        # ignored, the return statement is simple even when the function has
        # optional inputs.
        input, weight, bias = ctx.saved_variables
        grad_input = grad_weight = grad_bias = None

        # These needs_input_grad checks are optional and there only to
        # improve efficiency. If you want to make your code simpler, you can
        # skip them. Returning gradients for inputs that don't require it is
        # not an error.
        if ctx.needs_input_grad[0]:
            grad_input = grad_output.mm(weight)
        if ctx.needs_input_grad[1]:
            grad_weight = grad_output.t().mm(input)
        if bias is not None and ctx.needs_input_grad[2]:
            grad_bias = grad_output.sum(0).squeeze(0)

        return grad_input, grad_weight, grad_bias
  1. I am confused why input in forward in a tensor. How can their use will be registered in the graph if the input is just a tensor?
  2. In backward function, grad_output is a Variable that requires_grad is false.
    In my custom autograd, it is quite complex and needs something non-differential operations, so I perform some tricks to make it converge. I am wondering whether the following code is ok.
@staticmethod
def backward(ctx, grad_output):
    grad_output = grad_output.data
    # do something else to get the approximate grad_output
    grad_input = Variable(grad_output, requires_grad=False)
    return grad_input

Hi,

  1. The forward function does not need to work with Variables because you are defining the backward yourself.
    It is the autograd engine that unpacks the Variable to give Tensors to the forward function.
  2. The backward function on the other hand works with Variables (you may need to compute higher order derivatives so the graph of computation needs to be created). If grad_output.requires_grad=False, it is because the .backward(...) or .grad(...) function was called with a gradient that does not require grad.
    By default, the backward function should always work with Variable and create a proper graph (similarly to the forward function of an nn.Module).
    If in your case, you cannot create a proper graph, you need to add the @oncedifferentiable decorator to your backward function on top of the @staticmethod. In this case, you work with (EDIT:)Tensors and just need to output Tensors containing the gradients.
2 Likes

I do as you told. However, it gives the error message:NameError: name 'oncedifferentiable' is not defined.

1 Like

Sorry, small typo, it is once_differentiable that you can import with from torch.autograd.function import once_differentiable.

I add @once_differentiable on the top of @staticmethod, the error TypeError: 'staticmethod' object is not callable occurs.

Swap the two decorators.
You can see here how it is used.

1 Like

Weird, when I do this, the param grad_output becomes a Tensor instead of Variable whose requires_grad is false.

Ho I forgot it was doing the unpacking for you so that you donc need to deal with Variables. My bad. I edited my answer above.

If the grad_input has been turned into a tensor in my custom layer, then it will be transformed to Variables when gets to the former layer?

Yes, as you got a Tensor as input, you should return a Tensor.

I am still a bit confused. For normal autograd operations, since we have return the corresponding gradient such as grad_input as a Variable. Why do we need to make sure that we have created the proper graph? We can directly assign the grad_input to the variable input. It seems it has nothing to do with whether the graph is proper.

If your backward function does not have the once_differentiable decorator and does not create a proper graph in the backward function, then all higher order derivatives will be wrong. You have helper functions here and here if you want to check your first and second order derivatives implementation with finite differences.

1 Like

Thank albanD! You really help me!