Loss is 1 but gradients are zero

I am facing this issue of gradient being 0 even though the loss is not zero. loss stays at 1 while gradients are 0. I’m using the MSE loss function. Can anyone please help me here in debugging this?

Training code snippet:

# Train network
    max_epochs = max_epochs+1
    epoch = 1
    last_acc = 0
    while epoch < max_epochs:
        train_epoch_loss = 0
        accuracy = 0
        datalen = 0
        train_size = 0
        output = []
        target = []
        for batch_idx, (inps, tgts) in enumerate(train_loader):
            tgts = tgts.reshape((tgts.size(0), -1)).to(device)
            tgts = tgts.round()
            inps = inps.to(torch.float)
            outs = gcln(inps).to(device)
            gcln_ = copy.deepcopy(gcln)
            gcln_.cnf_layer_1.layer_or_weights = torch.nn.Parameter(
            gcln_.cnf_layer_1.layer_and_weights = torch.nn.Parameter(
            gcln_.cnf_layer_2.layer_or_weights = torch.nn.Parameter(
            gcln_.cnf_layer_2.layer_and_weights = torch.nn.Parameter(
            out_ = gcln_(inps)
            print("out_", out_.shape)
            if architecture == 1:
                t_loss = criterion(
                    outs, tgts[:, current_output].unsqueeze(-1))
                train_epoch_loss += t_loss.item()
            elif architecture == 2:
                t_loss = criterion(outs, tgts)
                train_epoch_loss += t_loss.item()
            elif architecture == 3:
                l = []
                for i in range(num_of_outputs):
                    l.append(criterion(outs[:, i], tgts[:, i]))
                t_loss = sum(l)
                train_epoch_loss += t_loss.item()/num_of_outputs
            train_size += outs.shape[0]

            if architecture == 1:
                              for e in out_.round().flatten().tolist()])
                target.append(tgts[:, current_output].tolist())
                accuracy += (out_.round().squeeze() ==
                             tgts[:, current_output]).sum()
            elif architecture > 1:
                              for e in out_.round().flatten().tolist()])
                accuracy += (out_.round() == tgts).sum()


To clarify your problem, please give us a code snippet.

added the code snippet for training loop

I think that code is well-written even though having some weird parts…
Loss value can be stuck with non-zero gradient but another case is not possible if your model is nicely defined.

Check every layer in your model to find any NaN or something.

Are you sure that these round() operations do not lead to all elements of weights being 0's?

I don’t think glcn_ and glcn are related.
The t_loss comes from the glcn.

I see. You are right. Didn’t notice that:)
Also, there are many unknown things in the question.
What are the range of values in tgts, what about the network design and the final layer activation etc.

@thecho7 there are weird parts as the problem in itself is unique. I need the weights to be interpretable. A particular weight value ranges between 0 and 1 but at the end I want them to be either 0 or 1. After training I have to read formula using the learned weights of the network.

Here’s the code for model:

class CNF_Netowrk(torch.nn.Module):
    def __init__(self, input_size, output_size, hidden_size, device) -> None:
        self.device = device
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        self.layer_or_weights = torch.nn.Parameter(
                self.input_size, self.hidden_size
            ).uniform_(0., 1.).to(dtype=torch.double).to(self.device)

        self.layer_and_weights = torch.nn.Parameter(
                self.hidden_size, self.output_size
            ).uniform_(0., 1.).to(dtype=torch.double).to(self.device)

    def apply_gates(self, x, y):
        return torch.mul(x, y)

    def forward(self, inputs):
        with torch.no_grad():
            self.layer_or_weights.data.clamp_(0.0, 1.0)
            self.layer_and_weights.data.clamp_(0.0, 1.0)

        # gated_inputs.shape: batch_size x hidden_size
        gated_inputs = self.apply_gates(self.layer_or_weights, inputs)
        o = 1 - gated_inputs

        # or_res.shape: batch_size x K
        or_res = 1 - util.tnorm_n_inputs(o)
        or_res = or_res.unsqueeze(-1)

        # gated_or_res.shape: batch_size x K
        gated_or_res = self.apply_gates(self.layer_and_weights, or_res)
        gated_or_res = torch.add(
            gated_or_res, 1.0 - self.layer_and_weights, alpha=1)

        # out.shape: batch_size x output_size
        outs = util.tnorm_n_inputs(gated_or_res).unsqueeze(-1)

        return outs

Also, I checked the outputs of each layer, there’s no NaNs in them.

@InnovArul gcln_ is deepcopied to round the weights to 0’s and 1’s so that I can use them to get binary outputs to be compared with tgts.

tgts is binary vector.

maybe you’re cutting computation graph somewhere in forward pass.
i think you can check it by filling parameter.grad with value other than zero.
then backward loss to see if it changes at all.

Just a hypothesis. Can you plot the distribution of self.layer_and_weights for every training iteration?
Is there a chance that after some iterations of training, this self.layer_and_weights is going 0?
Can you verify that?

No the weights are not going to 0 but it remains constant after some time. Which is because gradient being 0.

@mMagmer yes you are right I guess. gradients doesn’t change at all after doing what you said to do. But why is this not happening in case of other examples. And how to find out what is cutting the computation graph?

now that i think about it, the test is not the right way to check for computation graph.
i wanna say it’s in util.tnorm_n_inputs part, but i’m not sure.

Here’s the implementation for tnorm_n_inputs. I’m using the product case. Do you think that is problematic?

def tnorm_n_inputs(self, inp):
        Fuzzy alternative for Logical AND

        if self.name == "godel":
            out, _ = torch.min(inp, dim=-2)
            return out
        elif self.name == "product":
            return torch.prod(inp, -2)
            print("Wrong Name!")

no problem in this case. but torch.min will cut the graph,

Yeah. I’m not using that anywhere.

i think i am wrong about this,

No worries. Thanks for helping.

@InnovArul can you tell what happens if layer_and_weights goes to zero after some iterations?