# Inherit from Function
class LinearFunction(Function):
# Note that both forward and backward are @staticmethods
@staticmethod
# bias is an optional argument
def forward(ctx, input, weight, bias=None):
ctx.save_for_backward(input, weight, bias)
output = input.mm(weight.t())
if bias is not None:
output += bias.unsqueeze(0).expand_as(output)
return output
# This function has only a single output, so it gets only one gradient
@staticmethod
def backward(ctx, grad_output):
# This is a pattern that is very convenient - at the top of backward
# unpack saved_tensors and initialize all gradients w.r.t. inputs to
# None. Thanks to the fact that additional trailing Nones are
# ignored, the return statement is simple even when the function has
# optional inputs.
input, weight, bias = ctx.saved_variables
grad_input = grad_weight = grad_bias = None
# These needs_input_grad checks are optional and there only to
# improve efficiency. If you want to make your code simpler, you can
# skip them. Returning gradients for inputs that don't require it is
# not an error.
if ctx.needs_input_grad[0]:
grad_input = grad_output.mm(weight)
if ctx.needs_input_grad[1]:
grad_weight = grad_output.t().mm(input)
if bias is not None and ctx.needs_input_grad[2]:
grad_bias = grad_output.sum(0).squeeze(0)
return grad_input, grad_weight, grad_bias

I am confused why input in forward in a tensor. How can their use will be registered in the graph if the input is just a tensor?

In backward function, grad_output is a Variable that requires_grad is false.
In my custom autograd, it is quite complex and needs something non-differential operations, so I perform some tricks to make it converge. I am wondering whether the following code is ok.

@staticmethod
def backward(ctx, grad_output):
grad_output = grad_output.data
# do something else to get the approximate grad_output
grad_input = Variable(grad_output, requires_grad=False)
return grad_input

The forward function does not need to work with Variables because you are defining the backward yourself.
It is the autograd engine that unpacks the Variable to give Tensors to the forward function.

The backward function on the other hand works with Variables (you may need to compute higher order derivatives so the graph of computation needs to be created). If grad_output.requires_grad=False, it is because the .backward(...) or .grad(...) function was called with a gradient that does not require grad.
By default, the backward function should always work with Variable and create a proper graph (similarly to the forward function of an nn.Module).
If in your case, you cannot create a proper graph, you need to add the @oncedifferentiable decorator to your backward function on top of the @staticmethod. In this case, you work with (EDIT:)Tensors and just need to output Tensors containing the gradients.

I am still a bit confused. For normal autograd operations, since we have return the corresponding gradient such as grad_input as a Variable. Why do we need to make sure that we have created the proper graph? We can directly assign the grad_input to the variable input. It seems it has nothing to do with whether the graph is proper.

If your backward function does not have the once_differentiable decorator and does not create a proper graph in the backward function, then all higher order derivatives will be wrong. You have helper functions here and here if you want to check your first and second order derivatives implementation with finite differences.