Weights are not updaing? Does graph breaks when model passed through another class?

aswamy · August 12, 2021, 8:49pm

class NeuraNetworkClass(nn.Module):  # --> This is learnable Neural Network
    def __init__(self,):
        super().__init__()
        self.fc1 = nn.Linear(.......)
        ..........
        ..........

    def forward(self, x)
        return self.fc1(x)


class NonNeuralNetworkClass(nn.Module):  # This is non learnable nn.Module class(no neural networks in this class)
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, a, b):
        return self.model(a) + self.model(b)

model = NeuraNetwork1()
net = NonNeuralNetworkClass(model)

optim = torch.optim.Adam(lr=1e-4, params=model.parameters())

for epoch in rannge(100):
    loss = loss_func(net(a=inp1, b=inp2), labels)
    optim.zero_grad()
    loss.backward()
    optim.step()

Problem: With the above structure of coding, model parameters are not updating and gradients of model parameters are zero in every iteration.

Question: I have passed the model as an argument to the init method of NonNeuralNetworkClass. So, does the forward method of NonNeuralNetworkClass gets the updated
parameters of the model or do we need to pass the model every time in forward method of NonNeuralNetworkClass whenever the model parameter gets updated??

ptrblck · August 12, 2021, 11:36pm

Your approach should work as seen here:

class NeuraNetworkClass(nn.Module):
    def __init__(self,):
        super().__init__()
        self.fc1 = nn.Linear(1, 1)

    def forward(self, x):
        return self.fc1(x)


class NonNeuralNetworkClass(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, a, b):
        return self.model(a) + self.model(b)

model = NeuraNetworkClass()
net = NonNeuralNetworkClass(model)

optim = torch.optim.Adam(lr=1e-4, params=model.parameters())

loss_func = nn.MSELoss()
for epoch in range(10):
    inp1 = torch.randn(1, 1)
    inp2 = torch.randn(1, 1)
    labels = torch.randn(1, 1)
    loss = loss_func(net(a=inp1, b=inp2), labels)
    optim.zero_grad()
    loss.backward()
    for name, param in model.named_parameters():
        print(name, param.grad)
    optim.step()

so a zero gradient could indicate a vanishing gradient issue.

aswamy · August 13, 2021, 6:58am

Thanks for the response!

Weights of the model are never updated, I mean they are same as the initialised values. And the gradients are zero even from the 1st iteration. Can this still be a vanishing gradient problem?

ptrblck · August 13, 2021, 7:01am

Might be the case, but you should check, if the computation graph could be detached in some code, which wasn’t shown (it’s not detached in my example).
To do so, create the model and make sure that ll .grad attributes are None via:

for name, param in model.named_parameters():
    print(name, param.grad)

Skip the optimizer.zero_grad() call, create the model output and the loss, and call loss.backward().
Afterwards check the .grad attributes again.
If they show zero values, the gradients were properly calculated in the backward() op but are zero e.g. due to the model architecture etc.
If they are still None, the computation graph would be incorrect (assuming you expect all parameters to get a valid gradient).

aswamy · August 13, 2021, 7:51am

I tried this and got params.grad as None before loss.backward() and Zero after backward() step in the first iteration. If I continue for the next iterations, params.grad as zero both before and after as well, I think this is because in the first iteration grad is already zero and they are not updating.

Actually, to give a bit of an idea about architecture, I am trying to use the model to predict the parameters of the other network(let say InferenceModel()). The model architecture has 60 neural networks, each of which has 3 (inp, hid, out) layers. These 60 neural networks predicts(regress) 60 different parameters of InferenceModel(). This inference model takes input as images and outputs images(which means the loss is L2 between ground truth image and inferred image). So with this loss, I backpropagate the model's parameters. So I backpropogate on the loss computed using the output of InferenceModel() to update model parameters.

Model architecture is shown below for 1st two networks out of 60. The same NN blocks will be repeated for remaininnig 58 NN as well.

(1): FCBlock(
      (net): Sequential(
        (0): FCLayer(
          (net): Sequential(
            (0): Linear(in_features=512, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FCLayer(
          (net): Sequential(
            (0): Linear(in_features=128, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FCLayer(
          (net): Sequential(
            (0): Linear(in_features=128, out_features=262144, bias=True)
            (1): Tanh()
          )
        )
      )
    )
    (2): FCBlock(
      (net): Sequential(
        (0): FCLayer(
          (net): Sequential(
            (0): Linear(in_features=512, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (1): FCLayer(
          (net): Sequential(
            (0): Linear(in_features=128, out_features=128, bias=True)
            (1): LeakyReLU(negative_slope=0.01, inplace=True)
          )
        )
        (2): FCLayer(
          (net): Sequential(
            (0): Linear(in_features=128, out_features=512, bias=True)
            (1): Tanh()
          )
        )
      )
    )

FCBlock code
FCLayer code

aswamy · August 17, 2021, 7:23am

@ptrblck I do not see anything wrong with my network. Gradients are still zero. Could you please help me debug this? I can share the complete code if it makes it easy for you to suggest something.

ptrblck · August 17, 2021, 8:09am

Sure, I can take a look at your code, if you could post a minimal and executable code snippet (using random input data) to reproduce this issue.