PyTorch dynamic graphs unable to backpropagate

rahul · March 19, 2018, 11:23pm

I am using dynamic graphs in order to use conditional computation. Thus, for each image in a batch, the computational graph generated is different, and depends upon the actual value of image. However, when I try to backpropagate through it, I get the following error

"NoneType object has no attribute data "

Apparently, the gradients do not exist, are empty it seems! Is there a way to backpropagate through dynamic graphs? Any help will be very useful.

jpeg729 · March 19, 2018, 11:34pm

Are you trying to get .grad on a non-leaf variable?

Only Variables created with requires_grad=True get a .grad attribute. Pytorch hides the grads of all intermediate Variables.

rahul · March 19, 2018, 11:54pm

Hi, I am confused what do you mean by “leaf nodes”. Are these the nodes at the input layer (beginning), or at the end (last layer)? I am actually trying to get .grad of model parameters. Thanks

jpeg729 · March 20, 2018, 12:00am

Inputs and parameters are leaf nodes, results of calculations are not.
For example, supposing

x = Variable(torch.rand(5))
weight = Variable(torch.rand(5), requires_grad=True)
y = x*weight
# do other calculations
loss.sum().backward()

then

x.grad is None because x is a leaf node with requires_grad=False.
weight.grad is not None because weight is a leaf node with requires_grad=True.
y.grad is None because y is not a leaf node, it is the result of a calculation.

rahul · March 20, 2018, 12:12am

Hi,
so you are suggesting that by default, any variable is sort of requires_grad = False?
Do I need to explicitly mention, requires_grad = True even for model parameters?
Actually I tried that already, by setting requires_grad = True for each model parameter, but the same problem persists.
I’m a bit confused now. Shouldn’t you be calling the.grad function http://pytorch.org/docs/master/_modules/torch/autograd.html#grad in order to compute the gradients? I was wondering how did you end up with the gradients automatically?
Thanks a lot in advance!

jpeg729 · March 20, 2018, 12:20am

Variables by default do not require grad.
Parameters by default do require grad.

I have edited by example for clarity. Generally the grad function is only used in cases where the .backward() function doesn’t do quite what is needed.

rahul · March 20, 2018, 4:39am

Hi,
I think the following issue is happening. Suppose I have a NN with 2 layers, call it L1 and L2.
Now, for a data point x, I either use
NN1(x)–>NN2(x)–> output
or
NN2(x)–> output
or
x --> output

Now, it turns out that not every parameter of NN1 and NN2 is used for each data point.
It turns out that when a neural network parameter is not used in the forward pass, its parameter.grad is None. This is the reason that I am getting error. Seems very hard to get rid of. I am stuck in this.

jpeg729 · March 20, 2018, 8:27am

I always did wonder why in the source code of the optimizers, they loop over the given parameters and skip a parameter if its grad is None. For example, from the SGD source…

def step(self, closure=None):
    ...
    for group in self.param_groups:
        ...
        for p in group['params']:
            if p.grad is None:
                continue
            ...

rahul · March 20, 2018, 6:08pm

Hi,
I’m able to circumvent this issue by sending a pseudo input that uses all the neural networks, and then I multiply the output of this input by zero. Since this input’s output doesn’t depend upon value of input, its gradient is 0.

rahul · March 20, 2018, 6:30pm

Hi, Thanks! but what you proposed, won’t work. By default, if the grad is None, that’s ok for backprop on a single machine. But then, I am doing distributed training, and I need to average out the gradients over multiple workers. When the workers encounter None, they give error.

jpeg729 · March 20, 2018, 6:35pm

So replace it with zeros.

for p in params:
    if p.grad is None:
        p.grad = Variable(p.data.new(p.size()).fill_(0))

rahul · March 20, 2018, 6:55pm

does a None type have size? I am wondering if it will give error because None type may not have any attribute?

jpeg729 · March 20, 2018, 6:56pm

My mistake. I have edited my code.

rahul · March 20, 2018, 8:21pm

I guess it should be

 p.grad = Variable(p.data.new(p.size()).fill_(0))

jpeg729 · March 20, 2018, 8:25pm

Right again. I have edited my code again.

rahul · March 20, 2018, 9:00pm

Hi, I had a very general question. In Pytorch, I have seen usually using only the forward function to define a model. As an example, in this code taken from PyTorch website:

class Net(nn.Module):

def __init__(self):
    super(Net, self).__init__()
    # 1 input image channel, 6 output channels, 5x5 square convolution
    # kernel
    self.conv1 = nn.Conv2d(1, 6, 5)
    self.conv2 = nn.Conv2d(6, 16, 5)
    # an affine operation: y = Wx + b
    self.fc1 = nn.Linear(16 * 5 * 5, 120)
    self.fc2 = nn.Linear(120, 84)
    self.fc3 = nn.Linear(84, 10)

def forward(self, x):
    # Max pooling over a (2, 2) window
    x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
    # If the size is a square you can only specify a single number
    x = F.max_pool2d(F.relu(self.conv2(x)), 2)
    x = x.view(-1, self.num_flat_features(x))
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

However, say that I want to define multiple forward functions. Say that forward_1(self,x), forward_2(self,x), each of which use the same model parameters, but different computation graphs.

Is it ok to do that? Example

class Net(nn.Module):

def __init__(self):
    super(Net, self).__init__()
    # 1 input image channel, 6 output channels, 5x5 square convolution
    # kernel
    self.conv1 = nn.Conv2d(1, 6, 5)
    self.conv2 = nn.Conv2d(6, 16, 5)
    # an affine operation: y = Wx + b
    self.fc1 = nn.Linear(16 * 5 * 5, 120)
    self.fc2 = nn.Linear(120, 84)
    self.fc3 = nn.Linear(84, 10)

def forward_1(self, x):
    x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
    # If the size is a square you can only specify a single number
    x = F.max_pool2d(F.relu(self.conv2(x)), 2)
    return x
def forward_2(self,x):
    x = x.view(-1, self.num_flat_features(x))
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

Now say that I want to backprop using forward_1 once, and forward_2 once again.
Can I do the following??

output = net.forward_1(x)
loss = criterion(target,output)
loss.backward()
optimizer.step
output = net.forward_2(x)
loss = criterion(target,output)
loss.backward()
optimizer.step

is it ok to do that?

jpeg729 · March 20, 2018, 9:02pm

Yep.
As long as all the operations work with Variables, then backpropagation should work fine.