Should the input variable to a model require gradient?

I just happened to be going through some of my code and I noticed that my inputs to my model had requires_grad set to false. So I just went and tried out the basic pytorch example and found this was the same for the example as well. Here is the pytorch example:

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

net = Net()
input = torch.randn(1, 1, 32, 32)
target = torch.randn(10)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss()
# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)

The results of the print statements are None and False respectively. I’m just sorta confused, I thought gradients were supposed to accumulate in leaf_variables and this could only happen if requires_grad = True.

You are thinking correctly! In your example, your input is not a leaf variable so no grad will be accumulated for it which is the goal of your code too.

For instance, weights and biases of layers such as conv and linear are leaf variables and require grad and when you do backward, grads will be accumulated for them and optimizer will update those leaf variables. So, if you want to compute gradients with respect to your INPUTS too (which can be used to UPDATE INPUTS), like the weights, you need to enable grads for them and make them leaf.

For example, in your code if you add below line after input = torch.randn(1, 1, 32, 32), you can get grads of loss w.r.t. inputs:

input = input.clone().detach().requires_grad_(True)


awesome, I appreciate the help. definitely clears things up a bit