Why "loss.backward()" didn't update parameters' gradient?

Hi, I came across some problems about gradient update when training my network. I built a CNN network with “two” weights, the original float weight “self.weight” and a binarized one “self.Bi_weight”, I created them as the same:

    if transposed:
        # If transposed, [in, out, [kernal size]]
        self.weight = Parameter(torch.Tensor(
            in_channels, out_channels // groups, *kernel_size))
        self.Bi_weight = Parameter(torch.Tensor(self.weight.shape)).cuda()
        # If not, [out, in, [kernal size]]
        self.weight = Parameter(torch.Tensor(
            out_channels, in_channels // groups, *kernel_size))
        self.Bi_weight = Parameter(torch.Tensor(self.weight.shape)).cuda()

and initialed them by uniform, while when computing the gradient by “loss.backward()”, I found that the gradient “self.Bi_weight.grad” didn’t change (still None) and the model didn’t work.

I am still working on that and failed to figure out why the gradient didn’t update as usual. Any inspiration would be sincerely appreciated!!

Thank you!~


self.Bi_weight = Parameter(torch.Tensor(self.weight.shape)).cuda()

the .cuda() is computation and you don’t have a Parameter in self.Bi_weight.
Use self.Bi_weight = Parameter(torch.Tensor(self.weight.shape).cuda()) or better yet just leave the .cuda() alone and do model.cuda() at the end.

Oh, thank you for your reply! I used to think “.cuda()” is just to store the parameter on GPU, and now I find you are right!

I modified my code and left it alone, added “model.cuda()” before training, I think maybe the question mentioned above has been solved while another had arisen. Well, do you have any idea what that means?

Traceback (most recent call last):
File “cifar_bi.py”, line 370, in
File “cifar_bi.py”, line 203, in main
train_loss, train_acc = train(trainloader, model, criterion, optimizer, epoch, use_cuda)
File “cifar_bi.py”, line 267, in train
File “/home/xcz/.local/lib/python3.5/site-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/xcz/.local/lib/python3.5/site-packages/torch/autograd/init.py”, line 89, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: leaf variable has been moved into the graph interior

By the way, I checked the attribute “require_grad” of weight and Bi_weight, both of them are “True”, I am confusing about what the “leaf variable” means

Leaf variable is one that you explicitly create (with constructors/factory functions, as Parameter etc.) so that it has no predecessors in the sense that is is not computed. requires_grad can mean two things

  • Either it is a leaf and you require_grad explicitly (or via making it a Parameter),
  • or it has been computed from things requiring gradient.

Your error could be that you assign or modify inplace a leaf variable where you should not. This is usually hard to diagnose without looking at the code.

Thank you!

I think you really hit the point, I reviewed my code just now and I find out that maybe my fault is:

I create two Parameters, weight and Bi_weight in one conv layer, and when forward propagate, the data value of “Bi_weight” is computed from data value of “weight” like:

self.Bi_weight.data[channal] = Binary(self.weight.data[channal] )

When I create the two Parameters, “require_grad” of the both are “True”, so when BP, weight’s grad is computed from Bi_weight’s, and the latter require gradient.

Am I right about the reason why my code doesn’t work? And I am still wondering if I want loss.backward() not to compute the grad of weight to fix the bug, and when updating the parameters by “optimizer.step()”, the weight would be updated with Bi_weight’s grad, what should I do? I have tried set weight.require_grad to be False and update with the code:

for layer in model.modules():
if isinstance(layer, nn.Conv2d_Bi):
# print(“shot:”, layer.Bi_weight.requires_grad, layer.Bi_weight.grad)
layer.weight.grad = copy.deepcopy(layer.Bi_weight.grad)

But failed, optimizer can’t update a parameter not requiring grad.

So I’m not exactly sure I understand what you are aiming at, but to me it looks uncommon that you would have two parameters “for the same thing”.
I don’t know nothing about quantization of weights and haven’t read a thing in the literature on it, but if you want to quantize the weight and keep the unquantized weight around for training, my initial approach would be to not have the quantized weight as a parameter (it doesn’t make much sense to have a gradient if you aim at discrete values), but rather having the unquantized (“raw”) weight as a parameter, i.e. something like

self.weight_raw = Parameter(something)

in the setup and then

quantized_weight = self.weight_raw + (torch.round(self.weight_raw)-self.weight_raw).detach()

(taking torch.round to quantize the function). And then do the calculculation with quantized_weight.

This way, you will do the forward with the quantized weight, but get the gradient of the unquantized weight for updates and also update that during training.

Admittedly, this could be totally of because I misunderstood what you want to do.

(If it is useful, here is the credit: I learnt that trick when @hughperkins shared it in the context of sampling (gumbel softmax). I used that for shake-shake networks.)

@tom thank you for crediting me with citing the idea :slight_smile: Note for completeness that the original source of the idea AFAIK is Eric Jiang, https://github.com/ericjang/gumbel-softmax/blob/3c8584924603869e90ca74ac20a6a03d99a91ef9/Categorical%20VAE.ipynb

As a somewhat-relevant aside, interestingly, OpenAI used the .detach() operator as an actual mathematical symbol in their paper “DiCE: the infinitely differentiable Monte-Carlo estimator”, https://arxiv.org/abs/1802.05098 , see the second unlabelled equation on page 2.

Oh, I’m sorry Tom, it’s my fault:tired_face:. It is not my invention, in fact, I am trying to implement the “XNOR network” on PyTorch. The inventor of that network wants to make the network, especially the conv layers in the network to be smaller to store and faster to forward propagate.

His method is to set another “weight”, a binarized one W’ , in each conv layer. That is, in each conv layer, there are two weight, a normal one we are familiar with in other the state-of-the-art network W, and W’:

W’ = sign(W)*mean(abs(W))

When doing forward propagate, the author replace

conv_output = activation_function(conv_input*W)

By his new thought:

W’ = sign(W)* mean(abs(W))
conv_output = activation_function(conv(conv_input,*W’))

While in backward propagation, the gradient of W would be NAN because of some indifferentiable function like “sign()”. So the author use the grad of W’ to update W like:

W := W - learning_rate * W’.grad()

And that is the point I am stuck. I could have the “W’.grad()” computed and (it seems that) the gradient is right, while I don’t know how to update W. I have tried to use:

self.weight = Parameter(torch.Tensor(
out_channels, in_channels // groups, *kernel_size), requires_grad=True).cuda()
self.Bi_weight = Parameter(torch.Tensor(self.weight.shape))

    for layer in model.modules():
        if isinstance(layer, nn.Conv2d_Bi):
            layer.weight.grad = copy.deepcopy(layer.Bi_weight.grad)

and “optimizer.step()” but failed. I am trying something else. Do you have any advice or thought?

One quick comment:

when you have this in PyTorch (or in general), the gradient of sign will be 0 mostly - the sign function has derivative 0 except at 0. We can try this:

a = torch.randn(5, requires_grad=True)
b = a.abs().mean()*(torch.sign(a))
print ("b.grad", b.grad)
print ("a.grad", a.grad)

will give

b.grad tensor([ 1.,  1.,  1.,  1.,  1.])
a.grad tensor([ 0.2000,  0.2000, -0.2000, -0.2000,  0.2000])

or so which looks about right (depending on the ratio of positive vs. negative numbers in the random sample, the 0.2 (3-2) will be 0.6 (4-1) or 1.0 (5-0)).

I’m not sure I understand the update rule – maybe you want sign(W) in there?

But you could do W_prime.retain_grad() as above (again, I would not make W’ a parameter) and then just do

with torch.no_grad():
    W.add_(-lr, torch.sign(W)*W_prime.grad)

or so.
Would that work as expected?

I try to use the second different loss function and add it to the original one, but no updating occur in the weights. I change the second loss functions but no changes. Do you think is there any thing wrong? I am running the code on GPU. The first loss is nn.BCELoss() and the second is L1. The result is as same as using just BCNLoss, L1 or other losses does not have effects on the results.

output = netD(fake).view(-1)

# Calculate G's loss based on this output
errG1 = criterion(output, label)

xxx=torch.histc(GaussyMask.squeeze(1).view(-1).cpu(),100, min=0, max=1, out=None)

xxx1=torch.histc(fake.squeeze(1).view(-1).cpu(),100, min=0, max=1, out=None)


# Calculate gradients for G adding two losses

D_G_z2 = output.mean().item()
# Update G

I am getting same problem, and parameters aren’t updating in my custom_loss function after running .backward() found that problem was occuring afteroptimizer.step() I tried to see model before and after and it gave same results,


for i in range(EPOCHS):
    before = list(MODEL.parameters())[0].clone()
    for i in tqdm(valid_loader):
        q1 = i[0].to(device)
        q2 = i[1].to(device)
        q1_vec, q2_vec = MODEL(q2, q1)

        loss = CRITERION(q2_vec, q1_vec, MARGIN)

    after = list(MODEL.parameters())[0].clone()
    print(before == after)    

equaling before and after tensors results same after a epoch

tensor([[True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True]])

however removing initialize weights on siamese model had updated parameter but only on some side others

250/250 [00:19<00:00, 12.53it/s]

tensor([[ True,  True,  True,  ...,  True,  True,  True],
        [False, False, False,  ...,  True, False, False],
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True]])

250/250 [00:18<00:00, 13.54it/s]

tensor([[ True,  True,  True,  ...,  True,  True,  True],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True]])

250/250 [00:18<00:00, 13.67it/s]

tensor([[ True,  True,  True,  ...,  True,  True,  True],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True]])

250/250 [00:19<00:00, 12.67it/s]

I had defined optimizers as

CRITERION = HardTripletLoss(device)
MODEL = Siamese(len(vocab),128, 128, bidirectional=False).to(device)
OPTIMIZER = torch.optim.Adam(MODEL.parameters(),lr = 0.0001)

I tried different learning rates , model architecture and init weights commented below giving same type of problem only updating some parts even for longer epochs some time and sometimes not updating .

class Siamese(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, bidirectional=False, em_weight=None):
        :param vocab_size: defaultdict containing word to index
        :param embed_dim: embedding dim
        :param hidden_dim: hidden dim
        :param bidirectional: bool if True sets LSTM layer to bidirectional
        :param em_weight: embedding weight initialization see https://pytorch.org/docs/stable/nn.init.html
        super(Siamese, self).__init__()
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.hidden_dim = hidden_dim
        self.bidirectional = bidirectional
        self.emb_out_dim = self.hidden_dim if not self.bidirectional else self.hidden_dim * 2
        self.em = nn.Sequential(
            nn.Embedding(self.vocab_size, self.embed_dim),
            nn.LSTM(self.embed_dim, self.hidden_dim, batch_first=True, bidirectional=self.bidirectional)
        self.fc = nn.Sequential(
            nn.Linear(self.emb_out_dim, self.hidden_dim)
#         self.init_hidden(em_weight)

    def forward(self, x1, x2):
        # n1
        lstm_out_1, _ = self.em(x1)
        mean_layer1 = lstm_out_1[:, -1, :]
#         fc1 = self.fc(lstm_out_1[:, -1, :])

        normalize1 = F.normalize(mean_layer1)
        lstm_out_2, _ = self.em(x2)
        mean_layer2 = lstm_out_2[:, -1, :] # .mean(0, keepdims=True)
        normalize2 = F.normalize(mean_layer2)
#         fc2 = self.fc(lstm_out_2[:, -1, :])
        return normalize1, normalize2

#     def init_hidden(self, em_weight):
#         for m in self.modules():
#             if isinstance(m, nn.Embedding):
#                 if em_weight is None:
#                     nn.init.normal_(m.weight)
#                 else:
#                     em_weight(m.weight)
#             elif isinstance(m, nn.Linear):
#                 nn.init.normal_(m.weight)

I have defined my custom loss function as

class HardTripletLoss(nn.Module):
    def __init__(self, device='cpu'):
        Custom Hard triplet loss
        :param device: device type used
        super(HardTripletLoss, self).__init__()
        self.device = device

    def forward(self, v1, v2, margin):
        scores = v1 @ v2.T
        batch_size = len(scores)
        positive = torch.diag(scores)
        negative_without_positive = scores - 2.0 * torch.eye(batch_size).to(self.device)
        closest_negative = negative_without_positive.max(axis=1)[0]
        negative_zero_on_duplicate = scores * (1.0 - torch.eye(batch_size).to(self.device))
        mean_negative = torch.sum(negative_zero_on_duplicate, 1) / (batch_size - 1)
        triplet_loss1 = torch.maximum(margin - positive + mean_negative, torch.tensor(0).to(self.device))
        triplet_loss2 = torch.maximum(margin - positive + closest_negative, torch.tensor(0).to(self.device))
        triplet_loss = torch.mean(triplet_loss2 + triplet_loss1)
        return triplet_loss

I am confused where might be my mistake model or custom loss.
I had tested custom loss which had given right results on manual putting vectors.
What might be doing wrong here , Help!! ?

Besides the values if the parameters, try to check if all .grad attributes of all parameters are filled with some values after the first backward() call.
Before the first call, they should be initialized with None, afterwards they should contain values.
Depending on the training some parameters might get a zero gradient so you wouldn’t see the update in the parameter values directly.

I tried print([i.grad for i in list(MODEL.parameters())]) after backward, yes before grad was [None, None, None, None, None, None, None] after printing grad had updated, and here is my grad


Im still getting same kind of problem and also tried to test set, and prediction remain same all epochs

When i tried them to append in list and ran for 10 epochs and looked lists all contained zeros. and when equalizing first epoch grad and last epoch grad came to be True.

seen posts on other also tried to add loss.register_hook(lambda grad: print(grad)) both after and before loss.backward() which gave tensor(1.) before and after printed nothing.

Based on the output it seems Autograd is indeed able to backpropagate through the complete computation graph, but the gradients are tiny (zero or close to zero) so you would have to check the architecture and try to figure out which operation creates these small gradients.

There are two None entries in the output. Did you figure out which parameters return a None gradient and is this expected?

Gradients being tiny, I try to use nn.utils.clip_grad_value_(MODEL.parameters(), 1) but it didnt. But i am thinking, If model architecture was my problem, but I’ve been converting from another framework, which look similar,

def Siamese(vocab_size=len(vocab), d_model=128, mode='train'):
    def normalize(x):  # normalizes the vectors to have L2 norm normalize
        return x / fastnp.sqrt(fastnp.sum(x * x, axis=-1, keepdims=True))
    q_processor = tl.Serial(  # Processor will run on Q1 and Q2.
        tl.Embedding(vocab_size, d_model), # Embedding layer
        tl.LSTM(d_model), # LSTM layer
        tl.Mean(axis=1), # Mean over columns
        tl.Fn('Normalize', lambda x: normalize(x))  # Apply normalize function
    # Run on Q1 and Q2 in parallel.
    model = tl.Parallel(q_processor, q_processor)
    return model

which is same as our pytorch architecture

    def forward(self, x1, x2):

        lstm_out_1, _ = self.em(x1)
        mean_layer_1 = lstm_out_1.mean(1)
        normalized_1 = F.normalize(mean_layer_1)

        lstm_out_2, _ = self.em(x2)
        mean_layer_2 = lstm_out_2.mean(1)
        normalized_2 = F.normalize(mean_layer_2)
        return normalized_1, normalized_2

In case of loss , I also tried to use other distance losses such as torch.norm or torch.dist to see if grad is giving good, but gave same tiny result , It look same as before. Here is that framework hardtriplet loss but My pytorch loss is right as this, and gave same result when testing .

def TripletLoss(v1, v2, margin=0.25):
    scores = fastnp.dot(v1, v2.T)  # pairwise cosine sim
    batch_size = len(scores)
    positive = fastnp.diagonal(scores) # the positive ones (duplicates)
    negative_without_positive = scores - 2.0 * fastnp.eye(batch_size)
    closest_negative = negative_without_positive.max(axis=1)
    negative_zero_on_duplicate =  scores * (1.0 - fastnp.eye(batch_size))
    mean_negative = fastnp.sum(negative_zero_on_duplicate, axis=1) / (batch_size - 1)
    triplet_loss1 = fastnp.maximum(0.0, margin - positive + closest_negative)
    triplet_loss2 = fastnp.maximum(0.0, margin - positive + mean_negative)
    triplet_loss = fastnp.mean(triplet_loss1 + triplet_loss2)
    return triplet_loss

Training this trax framework was working and gave good results but converting to pytorch I’m stuck but cant figure out, I am thinking where could be my problem because using different architecture also didnt solve problem of grad to be tiny or zero.Help!!

That None was not expected in my gradient, using different architecture removed but came all zeros instead.

I’m not familiar with trax and would recommend to check smaller code blocks for the right implementation including the gradients.
I.e. isolate layers and compare the output as well as the gradients for both frameworks using a constant input and the same parameters.

Since you are dealing with LSTM modules, I would start with them and make sure the dimensions are the same in both implementations.


I was half able to figure out small reason why .grad was giving weird not updating grads. I was using batch=2 which was too small then used batch=256 then parameters were updating
Although they are not working are as tiny as before but .parameters() returned all false for longer epochs.

Another reason was my model was quite understandable, … some tweaking did quite changed how .grad like giving nan, zero, tiny but Still have one problem,

I have one question doubt to ask,

  • How can i return a layer that computes mean values using one tensor axis.
    = I assigned operation .mean(1) from previous and return that variable,but i am thinking it doesn’t return a mean layer , s it relevant or is there any layer to do ,I didn’t seen in docs

