PyTorch dynamic graphs unable to backpropagate

I am using dynamic graphs in order to use conditional computation. Thus, for each image in a batch, the computational graph generated is different, and depends upon the actual value of image. However, when I try to backpropagate through it, I get the following error

"NoneType object has no attribute data "

Apparently, the gradients do not exist, are empty it seems! Is there a way to backpropagate through dynamic graphs? Any help will be very useful.

Are you trying to get .grad on a non-leaf variable?

Only Variables created with requires_grad=True get a .grad attribute. Pytorch hides the grads of all intermediate Variables.

Hi, I am confused what do you mean by “leaf nodes”. Are these the nodes at the input layer (beginning), or at the end (last layer)? I am actually trying to get .grad of model parameters. Thanks

Inputs and parameters are leaf nodes, results of calculations are not.
For example, supposing

x = Variable(torch.rand(5))
weight = Variable(torch.rand(5), requires_grad=True)
y = x*weight
# do other calculations
loss.sum().backward()

then

  • x.grad is None because x is a leaf node with requires_grad=False.
  • weight.grad is not None because weight is a leaf node with requires_grad=True.
  • y.grad is None because y is not a leaf node, it is the result of a calculation.
1 Like

Hi,
so you are suggesting that by default, any variable is sort of requires_grad = False?
Do I need to explicitly mention, requires_grad = True even for model parameters?
Actually I tried that already, by setting requires_grad = True for each model parameter, but the same problem persists.
I’m a bit confused now. Shouldn’t you be calling the.grad function http://pytorch.org/docs/master/_modules/torch/autograd.html#grad in order to compute the gradients? I was wondering how did you end up with the gradients automatically?
Thanks a lot in advance!

Variables by default do not require grad.
Parameters by default do require grad.

I have edited by example for clarity. Generally the grad function is only used in cases where the .backward() function doesn’t do quite what is needed.

Hi,
I think the following issue is happening. Suppose I have a NN with 2 layers, call it L1 and L2.
Now, for a data point x, I either use
NN1(x)–>NN2(x)–> output
or
NN2(x)–> output
or
x --> output

Now, it turns out that not every parameter of NN1 and NN2 is used for each data point.
It turns out that when a neural network parameter is not used in the forward pass, its parameter.grad is None. This is the reason that I am getting error. Seems very hard to get rid of. I am stuck in this.

I always did wonder why in the source code of the optimizers, they loop over the given parameters and skip a parameter if its grad is None. For example, from the SGD source…

def step(self, closure=None):
    ...
    for group in self.param_groups:
        ...
        for p in group['params']:
            if p.grad is None:
                continue
            ...

Hi,
I’m able to circumvent this issue by sending a pseudo input that uses all the neural networks, and then I multiply the output of this input by zero. Since this input’s output doesn’t depend upon value of input, its gradient is 0.

Hi, Thanks! but what you proposed, won’t work. By default, if the grad is None, that’s ok for backprop on a single machine. But then, I am doing distributed training, and I need to average out the gradients over multiple workers. When the workers encounter None, they give error.

So replace it with zeros.

for p in params:
    if p.grad is None:
        p.grad = Variable(p.data.new(p.size()).fill_(0))

does a None type have size? I am wondering if it will give error because None type may not have any attribute?

My mistake. I have edited my code.

I guess it should be

 p.grad = Variable(p.data.new(p.size()).fill_(0))

Right again. I have edited my code again.

Hi, I had a very general question. In Pytorch, I have seen usually using only the forward function to define a model. As an example, in this code taken from PyTorch website:

class Net(nn.Module):

def __init__(self):
    super(Net, self).__init__()
    # 1 input image channel, 6 output channels, 5x5 square convolution
    # kernel
    self.conv1 = nn.Conv2d(1, 6, 5)
    self.conv2 = nn.Conv2d(6, 16, 5)
    # an affine operation: y = Wx + b
    self.fc1 = nn.Linear(16 * 5 * 5, 120)
    self.fc2 = nn.Linear(120, 84)
    self.fc3 = nn.Linear(84, 10)

def forward(self, x):
    # Max pooling over a (2, 2) window
    x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
    # If the size is a square you can only specify a single number
    x = F.max_pool2d(F.relu(self.conv2(x)), 2)
    x = x.view(-1, self.num_flat_features(x))
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

However, say that I want to define multiple forward functions. Say that forward_1(self,x), forward_2(self,x), each of which use the same model parameters, but different computation graphs.

Is it ok to do that? Example

class Net(nn.Module):

def __init__(self):
    super(Net, self).__init__()
    # 1 input image channel, 6 output channels, 5x5 square convolution
    # kernel
    self.conv1 = nn.Conv2d(1, 6, 5)
    self.conv2 = nn.Conv2d(6, 16, 5)
    # an affine operation: y = Wx + b
    self.fc1 = nn.Linear(16 * 5 * 5, 120)
    self.fc2 = nn.Linear(120, 84)
    self.fc3 = nn.Linear(84, 10)

def forward_1(self, x):
    x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
    # If the size is a square you can only specify a single number
    x = F.max_pool2d(F.relu(self.conv2(x)), 2)
    return x
def forward_2(self,x):
    x = x.view(-1, self.num_flat_features(x))
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

Now say that I want to backprop using forward_1 once, and forward_2 once again.
Can I do the following??

output = net.forward_1(x)
loss = criterion(target,output)
loss.backward()
optimizer.step
output = net.forward_2(x)
loss = criterion(target,output)
loss.backward()
optimizer.step

is it ok to do that?

Yep.
As long as all the operations work with Variables, then backpropagation should work fine.