Why "loss.backward()" didn't update parameters' gradient?

ChengzheXu · June 11, 2018, 12:36pm

Hi, I came across some problems about gradient update when training my network. I built a CNN network with “two” weights, the original float weight “self.weight” and a binarized one “self.Bi_weight”, I created them as the same:

    if transposed:
        # If transposed, [in, out, [kernal size]]
        self.weight = Parameter(torch.Tensor(
            in_channels, out_channels // groups, *kernel_size))
        self.Bi_weight = Parameter(torch.Tensor(self.weight.shape)).cuda()
    else:
        # If not, [out, in, [kernal size]]
        self.weight = Parameter(torch.Tensor(
            out_channels, in_channels // groups, *kernel_size))
        self.Bi_weight = Parameter(torch.Tensor(self.weight.shape)).cuda()

and initialed them by uniform, while when computing the gradient by “loss.backward()”, I found that the gradient “self.Bi_weight.grad” didn’t change (still None) and the model didn’t work.

I am still working on that and failed to figure out why the gradient didn’t update as usual. Any inspiration would be sincerely appreciated!!

Thank you!~

tom · June 11, 2018, 12:44pm

In

self.Bi_weight = Parameter(torch.Tensor(self.weight.shape)).cuda()

the .cuda() is computation and you don’t have a Parameter in self.Bi_weight.
Use self.Bi_weight = Parameter(torch.Tensor(self.weight.shape).cuda()) or better yet just leave the .cuda() alone and do model.cuda() at the end.

Best regards

Thomas

ChengzheXu · June 11, 2018, 1:03pm

Oh, thank you for your reply! I used to think “.cuda()” is just to store the parameter on GPU, and now I find you are right!

I modified my code and left it alone, added “model.cuda()” before training, I think maybe the question mentioned above has been solved while another had arisen. Well, do you have any idea what that means?

Traceback (most recent call last):
File “cifar_bi.py”, line 370, in
main()
File “cifar_bi.py”, line 203, in main
train_loss, train_acc = train(trainloader, model, criterion, optimizer, epoch, use_cuda)
File “cifar_bi.py”, line 267, in train
loss.backward()
File “/home/xcz/.local/lib/python3.5/site-packages/torch/tensor.py”, line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/xcz/.local/lib/python3.5/site-packages/torch/autograd/init.py”, line 89, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: leaf variable has been moved into the graph interior

Anyway, thank you for answering my question!

ChengzheXu · June 11, 2018, 1:19pm

By the way, I checked the attribute “require_grad” of weight and Bi_weight, both of them are “True”, I am confusing about what the “leaf variable” means

tom · June 11, 2018, 2:05pm

Leaf variable is one that you explicitly create (with constructors/factory functions, as Parameter etc.) so that it has no predecessors in the sense that is is not computed. requires_grad can mean two things

Either it is a leaf and you require_grad explicitly (or via making it a Parameter),
or it has been computed from things requiring gradient.

Your error could be that you assign or modify inplace a leaf variable where you should not. This is usually hard to diagnose without looking at the code.

Best regards

Thomas

ChengzheXu · June 11, 2018, 2:36pm

Thank you!

I think you really hit the point, I reviewed my code just now and I find out that maybe my fault is:

I create two Parameters, weight and Bi_weight in one conv layer, and when forward propagate, the data value of “Bi_weight” is computed from data value of “weight” like:

self.Bi_weight.data[channal] = Binary(self.weight.data[channal] )

When I create the two Parameters, “require_grad” of the both are “True”, so when BP, weight’s grad is computed from Bi_weight’s, and the latter require gradient.

Am I right about the reason why my code doesn’t work? And I am still wondering if I want loss.backward() not to compute the grad of weight to fix the bug, and when updating the parameters by “optimizer.step()”, the weight would be updated with Bi_weight’s grad, what should I do? I have tried set weight.require_grad to be False and update with the code:

optimizer.zero_grad()
loss.backward()
for layer in model.modules():
if isinstance(layer, nn.Conv2d_Bi):
# print(“shot:”, layer.Bi_weight.requires_grad, layer.Bi_weight.grad)
layer.weight.grad = copy.deepcopy(layer.Bi_weight.grad)
optimizer.step()

But failed, optimizer can’t update a parameter not requiring grad.

Thank you again!
Chengzhe XU

tom · June 12, 2018, 5:59am

So I’m not exactly sure I understand what you are aiming at, but to me it looks uncommon that you would have two parameters “for the same thing”.
I don’t know nothing about quantization of weights and haven’t read a thing in the literature on it, but if you want to quantize the weight and keep the unquantized weight around for training, my initial approach would be to not have the quantized weight as a parameter (it doesn’t make much sense to have a gradient if you aim at discrete values), but rather having the unquantized (“raw”) weight as a parameter, i.e. something like

self.weight_raw = Parameter(something)

in the setup and then

quantized_weight = self.weight_raw + (torch.round(self.weight_raw)-self.weight_raw).detach()

(taking torch.round to quantize the function). And then do the calculculation with quantized_weight.

This way, you will do the forward with the quantized weight, but get the gradient of the unquantized weight for updates and also update that during training.

Admittedly, this could be totally of because I misunderstood what you want to do.

(If it is useful, here is the credit: I learnt that trick when @hughperkins shared it in the context of sampling (gumbel softmax). I used that for shake-shake networks.)

Best regards

Thomas

hughperkins · June 12, 2018, 6:27am

@tom thank you for crediting me with citing the idea Note for completeness that the original source of the idea AFAIK is Eric Jiang, gumbel-softmax/Categorical VAE.ipynb at 3c8584924603869e90ca74ac20a6a03d99a91ef9 · ericjang/gumbel-softmax · GitHub

As a somewhat-relevant aside, interestingly, OpenAI used the .detach() operator as an actual mathematical symbol in their paper “DiCE: the infinitely differentiable Monte-Carlo estimator”, [1802.05098] DiCE: The Infinitely Differentiable Monte-Carlo Estimator , see the second unlabelled equation on page 2.

tom · June 12, 2018, 8:17am

I don’t think you ever claimed it was your invention. I just know I benefitted quite a bit from you posting it here. Apparently, you can apply the trick in other parts, too. Thanks again!

ChengzheXu · June 13, 2018, 1:19am

Oh, I’m sorry Tom, it’s my fault:tired_face:. It is not my invention, in fact, I am trying to implement the “XNOR network” on PyTorch. The inventor of that network wants to make the network, especially the conv layers in the network to be smaller to store and faster to forward propagate.

His method is to set another “weight”, a binarized one W’ , in each conv layer. That is, in each conv layer, there are two weight, a normal one we are familiar with in other the state-of-the-art network W, and W’:

W’ = sign(W)*mean(abs(W))

When doing forward propagate, the author replace

conv_output = activation_function(conv_input*W)

By his new thought:

W’ = sign(W)* mean(abs(W))
conv_output = activation_function(conv(conv_input,*W’))

While in backward propagation, the gradient of W would be NAN because of some indifferentiable function like “sign()”. So the author use the grad of W’ to update W like:

W := W - learning_rate * W’.grad()

And that is the point I am stuck. I could have the “W’.grad()” computed and (it seems that) the gradient is right, while I don’t know how to update W. I have tried to use:

self.weight = Parameter(torch.Tensor(
out_channels, in_channels // groups, *kernel_size), requires_grad=True).cuda()
self.Bi_weight = Parameter(torch.Tensor(self.weight.shape))

    for layer in model.modules():
        if isinstance(layer, nn.Conv2d_Bi):
            layer.weight.grad = copy.deepcopy(layer.Bi_weight.grad)

and “optimizer.step()” but failed. I am trying something else. Do you have any advice or thought?

Thank you Tom~:stuck_out_tongue_closed_eyes:

tom · June 13, 2018, 6:54am

Heya,

the comment about claiming inventions was not “aimed” at anyone, I’m sorry if it came across awkward in your thread.

One quick comment:

when you have this in PyTorch (or in general), the gradient of sign will be 0 mostly - the sign function has derivative 0 except at 0. We can try this:

a = torch.randn(5, requires_grad=True)
b = a.abs().mean()*(torch.sign(a))
b.retain_grad()
b.sum().backward()
print ("b.grad", b.grad)
print ("a.grad", a.grad)

will give

b.grad tensor([ 1.,  1.,  1.,  1.,  1.])
a.grad tensor([ 0.2000,  0.2000, -0.2000, -0.2000,  0.2000])

or so which looks about right (depending on the ratio of positive vs. negative numbers in the random sample, the 0.2 (3-2) will be 0.6 (4-1) or 1.0 (5-0)).

I’m not sure I understand the update rule – maybe you want sign(W) in there?

But you could do W_prime.retain_grad() as above (again, I would not make W’ a parameter) and then just do

with torch.no_grad():
    W.add_(-lr, torch.sign(W)*W_prime.grad)

or so.
Would that work as expected?

Best regards

Thomas

saba · August 29, 2020, 7:31am

Hi
I try to use the second different loss function and add it to the original one, but no updating occur in the weights. I change the second loss functions but no changes. Do you think is there any thing wrong? I am running the code on GPU. The first loss is nn.BCELoss() and the second is L1. The result is as same as using just BCNLoss, L1 or other losses does not have effects on the results.

label.fill_(real_label)  
label=label.to(device)
output = netD(fake).view(-1)

# Calculate G's loss based on this output
errG1 = criterion(output, label)


xxx=torch.histc(GaussyMask.squeeze(1).view(-1).cpu(),100, min=0, max=1, out=None)
ddGaussy=xxx/xxx.sum()

xxx1=torch.histc(fake.squeeze(1).view(-1).cpu(),100, min=0, max=1, out=None)
ddFake=xxx1/xxx1.sum()

MSECMBSS=abs(ddGaussy-ddFake).sum()

# Calculate gradients for G adding two losses

errG=errG1+MSECMBSS
errG.backward()
D_G_z2 = output.mean().item()
D_G_z22+=D_G_z2
# Update G
optimizerG.step()
```

ptrblck · August 31, 2020, 1:19am

Double post with answer from here.

mathematics · October 10, 2020, 12:16pm

Hi @ptrblck
I am getting same problem, and parameters aren’t updating in my custom_loss function after running ~~.backward()~~ found that problem was occuring afteroptimizer.step() I tried to see model before and after and it gave same results,

MODEL.train()

for i in range(EPOCHS):
    before = list(MODEL.parameters())[0].clone()
    for i in tqdm(valid_loader):
        q1 = i[0].to(device)
        q2 = i[1].to(device)
        q1_vec, q2_vec = MODEL(q2, q1)

        loss = CRITERION(q2_vec, q1_vec, MARGIN)

        loss.backward()
        OPTIMIZER.step()
        OPTIMIZER.zero_grad()
    after = list(MODEL.parameters())[0].clone()
    print(before == after)

equaling before and after tensors results same after a epoch

tensor([[True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        ...,
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True],
        [True, True, True,  ..., True, True, True]])

however removing initialize weights on siamese model had updated parameter but only on some side others

100%
250/250 [00:19<00:00, 12.53it/s]


tensor([[ True,  True,  True,  ...,  True,  True,  True],
        [False, False, False,  ...,  True, False, False],
        [ True,  True,  True,  ...,  True,  True,  True],
        ...,
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True]])

100%
250/250 [00:18<00:00, 13.54it/s]


tensor([[ True,  True,  True,  ...,  True,  True,  True],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True]])

100%
250/250 [00:18<00:00, 13.67it/s]


tensor([[ True,  True,  True,  ...,  True,  True,  True],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True]])

100%
250/250 [00:19<00:00, 12.67it/s]

I had defined optimizers as

CRITERION = HardTripletLoss(device)
MODEL = Siamese(len(vocab),128, 128, bidirectional=False).to(device)
OPTIMIZER = torch.optim.Adam(MODEL.parameters(),lr = 0.0001)

I tried different learning rates , model architecture and init weights commented below giving same type of problem only updating some parts even for longer epochs some time and sometimes not updating .

model

class Siamese(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, bidirectional=False, em_weight=None):
        """
        :param vocab_size: defaultdict containing word to index
        :param embed_dim: embedding dim
        :param hidden_dim: hidden dim
        :param bidirectional: bool if True sets LSTM layer to bidirectional
        :param em_weight: embedding weight initialization see https://pytorch.org/docs/stable/nn.init.html
        """
        super(Siamese, self).__init__()
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.hidden_dim = hidden_dim
        self.bidirectional = bidirectional
        self.emb_out_dim = self.hidden_dim if not self.bidirectional else self.hidden_dim * 2
        self.em = nn.Sequential(
            nn.Embedding(self.vocab_size, self.embed_dim),
            nn.LSTM(self.embed_dim, self.hidden_dim, batch_first=True, bidirectional=self.bidirectional)
        )
        self.fc = nn.Sequential(
            nn.Linear(self.emb_out_dim, self.hidden_dim)
        )
#         self.init_hidden(em_weight)

    def forward(self, x1, x2):
        # n1
        lstm_out_1, _ = self.em(x1)
        mean_layer1 = lstm_out_1[:, -1, :]
#         fc1 = self.fc(lstm_out_1[:, -1, :])

        normalize1 = F.normalize(mean_layer1)
        
        lstm_out_2, _ = self.em(x2)
        mean_layer2 = lstm_out_2[:, -1, :] # .mean(0, keepdims=True)
        normalize2 = F.normalize(mean_layer2)
#         fc2 = self.fc(lstm_out_2[:, -1, :])
        
        return normalize1, normalize2


#     def init_hidden(self, em_weight):
#         for m in self.modules():
#             if isinstance(m, nn.Embedding):
#                 if em_weight is None:
#                     nn.init.normal_(m.weight)
#                 else:
#                     em_weight(m.weight)
#             elif isinstance(m, nn.Linear):
#                 nn.init.normal_(m.weight)

I have defined my custom loss function as

loss

class HardTripletLoss(nn.Module):
    def __init__(self, device='cpu'):
        """
        Custom Hard triplet loss
        :param device: device type used
        """
        super(HardTripletLoss, self).__init__()
        self.device = device

    def forward(self, v1, v2, margin):
        scores = v1 @ v2.T
        batch_size = len(scores)
        positive = torch.diag(scores)
        negative_without_positive = scores - 2.0 * torch.eye(batch_size).to(self.device)
        closest_negative = negative_without_positive.max(axis=1)[0]
        negative_zero_on_duplicate = scores * (1.0 - torch.eye(batch_size).to(self.device))
        mean_negative = torch.sum(negative_zero_on_duplicate, 1) / (batch_size - 1)
        triplet_loss1 = torch.maximum(margin - positive + mean_negative, torch.tensor(0).to(self.device))
        triplet_loss2 = torch.maximum(margin - positive + closest_negative, torch.tensor(0).to(self.device))
        triplet_loss = torch.mean(triplet_loss2 + triplet_loss1)
        return triplet_loss

I am confused where might be my mistake model or custom loss.
I had tested custom loss which had given right results on manual putting vectors.
What might be doing wrong here , Help!! ?

ptrblck · October 11, 2020, 9:18am

Besides the values if the parameters, try to check if all .grad attributes of all parameters are filled with some values after the first backward() call.
Before the first call, they should be initialized with None, afterwards they should contain values.
Depending on the training some parameters might get a zero gradient so you wouldn’t see the update in the parameter values directly.

mathematics · October 11, 2020, 11:30am

I tried print([i.grad for i in list(MODEL.parameters())]) after backward, yes before grad was [None, None, None, None, None, None, None] after printing grad had updated, and here is my grad

Summary

[tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, …, 0.0000e+00,
0.0000e+00, 0.0000e+00],
[-1.2490e-16, 8.3267e-17, 1.1102e-16, …, 1.3878e-17,
-4.5103e-17, -6.9389e-18],
[-7.2703e-22, -2.7348e-22, 1.7239e-21, …, -2.4423e-21,
-1.5073e-21, 1.3535e-21],
…,
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, …, 0.0000e+00,
0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, …, 0.0000e+00,
0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, …, 0.0000e+00,
0.0000e+00, 0.0000e+00]]), tensor([[-3.3321e-18, -2.6706e-18, 1.7975e-18, …, -3.8878e-19,
3.3207e-18, -3.5714e-18],
[-1.4112e-17, -1.6273e-17, 8.3636e-18, …, -2.3020e-18,
1.8889e-17, -1.5621e-17],
[-1.8330e-21, 4.8687e-20, 7.1716e-20, …, -2.7070e-20,
3.2151e-20, -7.3505e-20],
…,
[-2.9748e-19, -3.8436e-19, -4.1967e-20, …, 8.1230e-20,
-9.9047e-19, 1.4248e-18],
[-2.4836e-17, -3.6160e-17, 1.7344e-17, …, -5.1495e-18,
3.4364e-17, -2.9532e-17],
[-3.9375e-18, 5.7459e-18, -4.1758e-18, …, -5.3851e-19,
1.5547e-17, -1.2789e-17]]), tensor([[ 1.4108e-18, 1.5147e-18, 2.3371e-19, …, -2.7911e-19,
-2.5587e-18, 1.4292e-18],
[ 8.5342e-18, 6.7513e-18, 8.1009e-19, …, -1.9343e-18,
-9.4413e-18, 3.6359e-18],
[-9.6708e-20, -3.1651e-20, -1.7680e-20, …, 2.2608e-19,
3.7656e-19, -5.0888e-19],
…,
[-4.4646e-19, -2.1585e-19, -3.7818e-19, …, -4.5977e-19,
2.7950e-18, -2.5085e-18],
[ 1.7277e-17, 1.1787e-17, 1.8961e-18, …, -4.0978e-18,
-2.0445e-17, 8.7052e-18],
[-2.8179e-18, -6.9943e-19, -7.0650e-20, …, 3.6348e-18,
9.6777e-19, -7.7131e-18]]), tensor([ 4.6894e-18, 2.2878e-17, -3.2786e-20, 1.3627e-17, 1.0941e-17,
1.9874e-19, -1.3745e-18, 5.6263e-18, 1.0998e-17, 4.2638e-18,
-6.8564e-17, 6.6776e-18, -1.8792e-17, 1.3853e-17, -6.0755e-17,
5.2941e-18, 7.6621e-17, 2.3874e-17, -8.7876e-18, -3.4695e-18,
4.9182e-18, 6.6025e-18, 7.8126e-20, 5.1197e-18, 1.7992e-18,
-2.1652e-17, -5.2732e-20, 6.5220e-18, 5.1715e-18, 3.0783e-18,
5.0582e-17, 1.7135e-17, 3.7236e-19, 1.2215e-17, -1.3027e-17,
1.4486e-18, 3.0342e-18, 1.0553e-17, -2.1137e-18, -4.0099e-17,
2.8459e-18, 4.1955e-21, 1.6731e-17, -1.7585e-17, -4.1442e-17,
1.2786e-17, 1.3793e-18, 9.0336e-18, -1.9907e-18, 1.3201e-18,
-1.1858e-17, 9.4918e-18, -6.0241e-18, -4.7354e-18, 1.9112e-18,
4.6455e-17, 4.3406e-18, -4.6537e-18, -8.3033e-18, -9.4413e-19,
-5.5989e-17, 4.1224e-18, 3.7873e-18, 8.4627e-18, 3.3483e-17,
-1.0763e-18, 1.0770e-17, 1.5640e-17, 3.5447e-18, -1.3743e-18,
-1.1180e-18, -8.8832e-18, -2.9878e-17, -1.6214e-17, -2.7716e-18,
1.4856e-17, 3.6935e-18, 2.9767e-18, 2.0487e-19, 1.9427e-17,
1.3582e-17, 1.9970e-19, 9.9781e-18, 2.2430e-19, 7.3522e-19,
-2.7334e-17, -5.3880e-18, 2.6055e-17, 5.5365e-19, 2.6485e-17,
3.6412e-18, -1.8318e-17, 2.1838e-17, -4.2521e-18, 9.6854e-19,
2.8140e-17, 7.2106e-18, -5.1136e-18, -6.2155e-17, 1.6230e-19,
-2.4334e-18, -3.5710e-18, 1.6169e-19, 1.0472e-18, -3.4524e-17,
-3.9877e-16, -1.4390e-18, -1.2469e-17, -1.3241e-18, -1.8618e-19,
2.4753e-18, 2.2206e-17, -3.4054e-18, 9.8636e-18, 1.3726e-18,
-1.0450e-17, 1.1512e-17, -3.5868e-18, -3.8310e-18, 9.3735e-18,
7.7030e-18, 8.9256e-18, 4.6936e-17, -7.8218e-18, -4.3646e-17,
-1.1260e-17, 2.2500e-18, -2.4667e-17, 6.2282e-18, 2.4533e-17,
3.5558e-18, 5.0018e-17, 1.1691e-17, -1.1232e-18, -2.9300e-18,
1.0509e-18, 2.2208e-17, 3.1611e-18, -7.5752e-17, 9.1736e-18,
7.6924e-17, 2.0052e-17, -3.8796e-17, -7.9214e-18, 1.0008e-16,
1.1697e-17, -3.7738e-18, 2.9461e-18, 2.6797e-18, 3.4428e-18,
-5.5798e-18, 2.9515e-18, 8.9296e-18, -2.9892e-17, -1.8515e-18,
4.7701e-18, 1.9752e-17, 9.9867e-18, 1.1276e-16, 1.1546e-17,
2.9342e-19, 3.0412e-17, -8.4625e-18, -2.9331e-19, 1.8331e-18,
1.5090e-17, -2.1471e-18, -2.3602e-17, -2.1058e-18, 5.9235e-18,
2.3027e-18, -5.0682e-18, -2.9411e-17, 5.1302e-18, 8.6591e-19,
1.8884e-18, -1.4682e-18, 4.9897e-19, -3.1757e-18, 9.1659e-18,
-3.9866e-18, 4.4502e-18, -4.2402e-18, 4.7148e-17, 5.7015e-18,
6.8482e-19, -5.0601e-18, 2.3989e-19, -2.8327e-17, 2.4965e-18,
-3.0540e-18, 2.5545e-18, 3.2476e-17, -7.8626e-19, 8.8460e-18,
-8.2426e-18, 1.6532e-17, 3.3342e-18, -1.2113e-18, -7.9274e-18,
-7.7510e-18, 6.7812e-18, 6.0984e-19, 7.9837e-18, 8.5446e-18,
3.7967e-18, 2.3370e-19, 1.7041e-17, -7.6878e-18, 3.0702e-19,
1.3763e-17, -2.3991e-19, 8.5431e-19, -1.0043e-17, -6.4686e-18,
8.5168e-17, 1.1107e-18, 9.6135e-18, 4.7863e-18, -1.2070e-17,
1.5583e-17, -6.2305e-18, 1.6577e-19, 2.4602e-17, -6.9084e-19,
-1.3175e-17, -4.0134e-17, 1.3637e-19, -3.7382e-18, -1.1479e-17,
8.5472e-19, 7.5166e-19, -2.5990e-17, -2.1655e-16, 2.0100e-18,
-2.1692e-17, 6.9837e-19, -1.2903e-18, 1.6622e-18, 2.1073e-17,
-4.8277e-18, 9.5853e-18, 4.6053e-18, -1.4567e-17, -4.1455e-17,
-9.6192e-18, -2.5817e-18, 3.7956e-18, 9.0075e-18, 4.5402e-19,
3.8167e-17, -8.4780e-18, -2.6936e-17, 1.0899e-17, 1.3199e-17,
-3.2709e-17, 2.3725e-17, 6.7443e-17, 1.4727e-16, -5.1448e-17,
-1.8603e-17, -5.3597e-18, 6.7770e-17, 1.8474e-17, -5.7802e-17,
3.2249e-17, 2.3948e-17, 5.5822e-19, 2.1072e-17, -5.4675e-17,
2.4305e-17, -2.9202e-17, 2.1054e-16, -1.5659e-19, 5.5076e-17,
3.9664e-17, -1.3162e-18, -1.0806e-16, 2.4289e-17, 2.4237e-17,
-7.1480e-18, 1.2465e-16, -5.8018e-18, 9.5258e-17, 1.4498e-16,
2.4598e-17, 1.1915e-17, -6.1168e-17, -1.1223e-17, -2.7088e-16,
-2.8248e-17, 6.1303e-18, 1.1962e-17, 1.3037e-17, 4.8261e-18,
1.8888e-16, -3.4205e-17, -4.7457e-17, -1.4244e-17, 1.4729e-16,
1.3887e-16, -2.6589e-17, -3.8434e-18, 2.2027e-17, -3.1586e-17,
6.2399e-17, -1.3481e-17, -1.9470e-17, 6.0803e-18, 9.7020e-17,
-2.7981e-18, 9.3520e-17, 8.7139e-18, -1.2578e-16, 7.6779e-18,
-2.3042e-18, -1.6426e-16, 4.4623e-17, -1.4743e-17, 2.6710e-17,
-1.9759e-17, -5.7887e-18, 2.6003e-17, 4.8592e-17, -2.1962e-17,
5.9888e-19, -2.2442e-17, -8.9339e-18, -9.2629e-17, -8.0043e-17,
1.0755e-19, 1.3169e-17, 1.2347e-17, 2.0018e-17, -6.8180e-18,
6.2976e-17, 1.7267e-17, -3.2384e-18, 7.4895e-17, -5.7886e-17,
5.7051e-18, -1.3402e-17, -2.6480e-17, -2.7553e-17, -2.1836e-17,
-6.5037e-17, -4.3791e-17, -4.7984e-17, -3.0518e-17, -6.1433e-17,
-2.1685e-17, -3.8707e-17, 1.0604e-17, 9.1305e-17, -2.1928e-16,
1.7237e-18, -2.9771e-17, -8.4547e-17, -4.4545e-19, -7.5876e-17,
2.2069e-16, -1.4875e-16, -5.9945e-18, 5.5648e-17, 4.5859e-18,
1.1488e-17, -3.4286e-17, -1.1882e-16, 3.0529e-17, -4.3211e-17,
-5.4673e-17, -3.1828e-17, 7.6727e-17, -2.9937e-17, 1.2139e-17,
6.6972e-18, 1.7996e-17, -8.7738e-18, 1.6183e-16, 4.1121e-17,
1.0478e-16, -1.8862e-16, -9.0038e-18, 3.8925e-16, 1.1418e-17,
2.2525e-17, 3.4782e-19, 2.9962e-17, 3.8262e-18, -6.3649e-19,
-2.5475e-18, 1.0266e-18, 1.6822e-17, 6.6444e-18, -5.6500e-17,
1.2025e-17, 1.3575e-17, 3.4246e-17, -6.1134e-17, -1.8998e-17,
1.4285e-16, 2.8016e-17, -2.1111e-18, 4.8607e-18, 4.5100e-18,
6.8632e-18, -7.3171e-18, 2.2022e-18, 5.9117e-18, -6.9444e-19,
-9.0897e-19, 6.0855e-18, -1.0563e-17, 4.7794e-18, -5.0497e-17,
4.6059e-18, -4.2439e-18, 2.0118e-17, -6.8387e-18, -1.2477e-19,
4.1060e-18, 1.2945e-17, 1.5230e-17, -3.1032e-17, 1.0578e-17,
-3.5399e-19, 9.5589e-18, 1.3820e-17, -2.7985e-17, 6.9498e-18,
3.5880e-18, -7.4323e-18, -2.6509e-18, 6.6369e-19, -1.1158e-18,
1.1048e-17, -4.7172e-18, 5.7394e-18, -4.6448e-18, 2.2590e-17,
1.6515e-18, -2.9269e-18, -9.6607e-18, -1.5475e-18, -4.6020e-18,
1.3267e-17, 7.2152e-18, 6.7048e-18, 2.2111e-17, -9.1188e-19,
8.4626e-18, -5.2299e-18, 2.0959e-18, -7.6183e-21, -1.2962e-18,
-3.1053e-17, -9.9650e-19, 4.7046e-18, 2.3180e-18, 2.5064e-17,
1.1626e-17, 8.5192e-18, -8.0132e-19, -7.5546e-18, 1.0793e-17,
3.3403e-19, 8.5442e-18, 4.5244e-19, 2.4044e-18, -4.7261e-18,
-7.0316e-18, 8.5242e-17, 2.9087e-19, 1.0278e-17, 7.8074e-18,
-6.0791e-18, 1.0209e-17, -4.3651e-18, 1.3574e-18, 1.0522e-17,
-1.3389e-18, -7.4163e-18, -5.6165e-17, 4.8978e-19, -1.1010e-18,
-9.9971e-18, -1.0530e-17, 8.8581e-19, -1.6113e-16, 1.5054e-17,
-9.5008e-18, -9.4622e-18, -9.9542e-19, -2.4850e-19, 9.1072e-19,
1.4648e-17, -1.5572e-17, 1.0750e-17, 1.3099e-18, 6.9801e-18,
-3.3712e-18, -9.8289e-18, -2.3822e-18, -1.4862e-17, 1.2849e-17,
1.4828e-17, 3.5821e-17, -3.4962e-18, -1.3818e-16, 5.3604e-19,
4.2359e-17, 3.4459e-18]), tensor([ 4.9362e-18, 2.2936e-17, -2.2510e-21, 1.3954e-17, 1.1016e-17,
1.7034e-19, -1.4443e-18, 5.5511e-18, 1.1185e-17, 4.2643e-18,
-6.3252e-17, 6.1765e-18, -1.7511e-17, 1.3809e-17, -6.2377e-17,
6.3236e-18, 7.6685e-17, 2.3932e-17, -8.1387e-18, -4.2070e-18,
5.2576e-18, 6.6007e-18, -5.4815e-19, 4.7976e-18, 3.4002e-18,
-2.1222e-17, -4.6640e-19, 6.2453e-18, 5.2609e-18, 5.5965e-18,
4.2895e-17, 1.6387e-17, 3.0782e-19, 1.2242e-17, -1.2567e-17,
1.2059e-18, 3.3827e-18, 1.1002e-17, -2.1824e-18, -3.9807e-17,
3.0122e-18, 1.8311e-18, 1.8323e-17, -1.6295e-17, -4.2135e-17,
1.2221e-17, 2.2075e-18, 9.4183e-18, -1.9139e-18, 1.2492e-18,
-1.1885e-17, 9.6804e-18, -6.1074e-18, -2.7535e-18, 3.5420e-19,
4.7430e-17, 7.1862e-18, -5.2519e-18, -7.9737e-18, -6.0169e-19,
-5.6502e-17, 4.4758e-18, 4.2638e-18, 6.5799e-18, 3.5506e-17,
-1.1032e-18, 8.9974e-18, 9.5278e-18, 3.7240e-18, -1.4219e-18,
-1.3560e-18, -6.2422e-18, -3.6134e-17, -9.6425e-18, -2.2241e-18,
1.4405e-17, 3.7618e-18, 2.8561e-18, -4.1766e-19, 1.7731e-17,
6.7696e-18, 2.4227e-19, 1.0080e-17, 2.8824e-19, 8.9742e-19,
-2.8764e-17, -5.5244e-18, 2.0831e-17, 4.8464e-19, 3.1663e-17,
4.3477e-18, -1.8152e-17, 2.3708e-17, -4.0415e-18, 9.3695e-19,
3.0694e-17, 6.7760e-18, -7.5348e-18, -5.9472e-17, 1.6476e-19,
-2.5047e-18, -3.3360e-18, 2.4882e-19, 1.0710e-18, -4.6206e-17,
-4.0711e-16, -6.5514e-19, -1.1404e-17, -1.1090e-18, -4.1890e-19,
2.4172e-18, 2.2300e-17, -3.6351e-18, 1.0125e-17, 1.4071e-18,
-1.0039e-17, 1.1518e-17, -4.2658e-18, -3.6611e-18, 4.5222e-18,
8.0660e-18, 8.7378e-18, 4.8309e-17, -8.7634e-18, -4.1615e-17,
-1.1161e-17, 2.2106e-18, -2.2021e-17, 6.2329e-18, 2.5545e-17,
3.8307e-18, 5.4364e-17, 1.1453e-17, -1.0200e-18, -2.9809e-18,
1.2575e-18, 2.0679e-17, 3.1357e-18, -7.8127e-17, 1.1643e-17,
7.6203e-17, 1.8782e-17, -4.3048e-17, -6.8940e-18, 9.9898e-17,
1.5218e-17, -3.6908e-18, 4.2516e-18, 3.5237e-18, 3.8335e-18,
-4.9942e-18, 2.9736e-18, 8.2311e-18, -3.3575e-17, -1.9176e-18,
4.8358e-18, 1.0951e-17, 8.7772e-18, 1.1003e-16, 1.0447e-17,
7.8612e-19, 3.0270e-17, -8.6018e-18, -3.6977e-19, 1.1465e-18,
1.2588e-17, -1.7002e-18, -3.3181e-17, 1.3517e-18, 3.6812e-18,
4.2816e-18, -3.9753e-18, -2.8377e-17, 5.0931e-18, 5.4417e-19,
2.4738e-18, -1.4718e-18, 6.1403e-19, -2.9697e-18, 8.6798e-18,
-3.8681e-18, 4.6213e-18, -3.5057e-18, 5.0246e-17, 7.9875e-18,
7.2052e-19, -5.1990e-18, -1.2102e-20, -2.8361e-17, 2.4296e-18,
-3.5839e-18, 3.6954e-18, 3.4344e-17, -9.2748e-19, 8.6256e-18,
-7.4636e-18, 1.6192e-17, 2.1358e-18, -1.2398e-18, -9.0602e-18,
-9.0978e-18, 1.0492e-17, 9.8555e-19, 8.3060e-18, 8.1893e-18,
3.7783e-18, 3.3466e-19, 2.5045e-17, -5.8201e-18, 4.6065e-19,
1.3261e-17, -1.9736e-19, 7.9038e-19, -7.0374e-18, -6.3854e-18,
9.0467e-17, 1.0669e-18, 8.4521e-18, 4.4519e-18, -1.2971e-17,
1.5421e-17, -6.1741e-18, 1.7955e-19, 2.3931e-17, -2.9355e-19,
-1.2239e-17, -4.5637e-17, 1.5181e-19, -3.7151e-18, -1.0910e-17,
7.9556e-19, 8.0653e-19, -3.8689e-17, -2.1239e-16, 2.0147e-18,
-2.2982e-17, 3.5345e-19, -1.2084e-18, 1.6065e-18, 1.9395e-17,
-4.2445e-18, 9.2921e-18, 4.6022e-18, -1.6440e-17, -3.5215e-17,
-9.0457e-18, -2.8167e-18, 7.6013e-18, 8.9288e-18, 2.5929e-18,
3.7999e-17, -6.7984e-18, -2.3559e-17, 1.0460e-17, 1.2774e-17,
-5.2062e-17, 3.4900e-17, 6.8593e-17, 1.4305e-16, -3.8362e-17,
-1.8674e-17, -6.9610e-18, 6.2260e-17, 1.9923e-17, -5.6744e-17,
3.3335e-17, 1.6514e-17, -6.6213e-18, 2.0843e-17, -5.1268e-17,
3.8805e-17, -3.5528e-17, 2.0950e-16, -8.0654e-19, 5.2746e-17,
4.7427e-17, -1.8672e-18, -9.8169e-17, 2.6529e-17, 1.8199e-17,
-2.2983e-17, 1.2325e-16, -7.3968e-18, 9.2331e-17, 1.6311e-16,
4.4656e-17, 1.1709e-16, -5.8373e-17, -1.1222e-17, -2.5161e-16,
-2.9289e-17, 1.3616e-18, 7.0258e-18, 1.5283e-17, 5.7801e-18,
1.8009e-16, -3.2422e-17, -4.7347e-17, -1.0852e-17, 1.3971e-16,
1.3902e-16, -2.3356e-17, -4.4717e-18, 3.0794e-17, -3.1411e-17,
6.5792e-17, -1.4599e-17, -1.7412e-17, 6.8107e-18, 9.0062e-17,
-6.1052e-18, 6.8808e-17, 8.5833e-18, -1.0921e-16, 7.6710e-18,
-6.4827e-18, -1.8865e-16, 5.3078e-17, -9.5499e-18, 2.6680e-17,
-1.9183e-17, -4.3782e-18, 1.8593e-17, 4.0812e-17, -2.9184e-17,
-2.1878e-19, -2.2112e-17, -8.8170e-18, -7.9106e-17, -7.0855e-17,
-5.3535e-19, 1.3900e-17, 1.1227e-17, 1.9197e-17, -6.3653e-18,
6.5085e-17, -3.4752e-17, 5.0891e-18, 8.0362e-17, -6.0258e-17,
5.7050e-18, -8.9185e-18, -2.5582e-17, 3.4527e-18, -2.9613e-17,
-4.0404e-17, -4.3652e-17, -4.7739e-17, -2.9517e-17, -6.3456e-17,
-1.9390e-17, -3.7127e-17, 1.0940e-17, 8.9918e-17, -2.1477e-16,
1.6428e-18, -3.2010e-17, -8.2931e-17, -3.6578e-19, -7.1128e-17,
2.2080e-16, -1.4557e-16, -4.6207e-18, 5.3004e-17, 3.0838e-18,
8.0135e-18, -3.5249e-17, -1.2017e-16, 2.8130e-17, -5.1097e-17,
-5.8329e-17, -3.0885e-17, 6.6784e-17, -2.6974e-17, 1.0205e-17,
-5.9526e-18, 1.6670e-17, -1.8663e-18, 1.6732e-16, 3.9934e-17,
6.2854e-17, -1.7888e-16, -9.1072e-18, 3.7748e-16, 1.1717e-17,
2.1023e-17, 1.3695e-19, 3.1324e-17, 3.3726e-18, -3.8983e-19,
-2.1267e-18, 1.1544e-18, 1.6370e-17, 6.5965e-18, -5.2461e-17,
7.3907e-18, 1.7607e-17, 2.9305e-17, -5.4955e-17, -1.2667e-17,
1.4040e-16, 3.0134e-17, -2.2085e-18, 3.0810e-18, 2.5141e-18,
6.9348e-18, -8.5589e-18, 2.7082e-18, 5.3448e-18, -8.7780e-18,
-5.8876e-19, 5.8723e-18, -1.5790e-17, 4.5447e-18, -4.7590e-17,
4.3929e-18, -3.8425e-18, 2.0022e-17, -7.4087e-18, 5.1673e-20,
3.0409e-18, 1.1157e-17, 1.1806e-17, -3.0585e-17, 1.0497e-17,
-1.4629e-18, 8.9381e-18, 1.3862e-17, -3.1782e-17, 8.5061e-18,
1.9678e-18, -7.4477e-18, -3.0301e-18, 7.8834e-19, -8.2740e-19,
1.0139e-17, -4.1715e-18, 7.1402e-18, 8.8143e-21, 2.4060e-17,
6.2953e-18, -2.0181e-18, -1.0133e-17, -1.3212e-18, -9.1415e-18,
1.2521e-17, -1.4216e-18, 1.3064e-17, 2.2160e-17, -9.7139e-19,
7.9220e-18, -1.4878e-18, 9.9472e-18, -2.1372e-18, -1.3081e-18,
-2.6036e-17, 4.2590e-18, 1.1422e-17, 1.8130e-18, 2.2757e-17,
7.2580e-18, 8.6838e-18, -5.3895e-19, -2.2456e-18, 1.6428e-17,
3.3106e-19, 8.3411e-18, 3.9999e-19, 2.3099e-18, -1.5327e-17,
-9.4874e-18, 8.1655e-17, 2.3883e-19, 9.6433e-18, 7.8936e-18,
-5.3725e-18, 1.1550e-17, -4.3309e-18, 1.3164e-18, 1.2375e-17,
-1.6544e-18, -7.6797e-18, -5.4422e-17, 4.8943e-19, -1.0145e-18,
-9.8863e-18, -6.8381e-18, 9.0312e-19, -1.6090e-16, 1.7354e-17,
-5.8901e-18, -1.2204e-17, -1.2899e-18, -6.1576e-19, 7.1545e-19,
1.2607e-17, -1.4044e-17, 1.1446e-17, 1.2545e-18, 6.9880e-20,
-1.5528e-18, -8.9723e-18, -5.0329e-18, -1.1036e-17, 1.2906e-17,
1.5771e-17, 3.4780e-17, -6.1600e-18, -1.4020e-16, -1.3937e-19,
4.7767e-17, 3.9442e-18]), None, None]

Im still getting same kind of problem and also tried to test set, and prediction remain same all epochs

When i tried them to append in list and ran for 10 epochs and looked lists all contained zeros. and when equalizing first epoch grad and last epoch grad came to be True.

seen posts on other also tried to add loss.register_hook(lambda grad: print(grad)) both after and before loss.backward() which gave tensor(1.) before and after printed nothing.

ptrblck · October 12, 2020, 12:27am

Based on the output it seems Autograd is indeed able to backpropagate through the complete computation graph, but the gradients are tiny (zero or close to zero) so you would have to check the architecture and try to figure out which operation creates these small gradients.

There are two None entries in the output. Did you figure out which parameters return a None gradient and is this expected?

mathematics · October 12, 2020, 2:59am

Gradients being tiny, I try to use nn.utils.clip_grad_value_(MODEL.parameters(), 1) but it didnt. But i am thinking, If model architecture was my problem, but I’ve been converting from another framework, which look similar,

def Siamese(vocab_size=len(vocab), d_model=128, mode='train'):
    def normalize(x):  # normalizes the vectors to have L2 norm normalize
        return x / fastnp.sqrt(fastnp.sum(x * x, axis=-1, keepdims=True))
    
    q_processor = tl.Serial(  # Processor will run on Q1 and Q2.
        tl.Embedding(vocab_size, d_model), # Embedding layer
        tl.LSTM(d_model), # LSTM layer
        tl.Mean(axis=1), # Mean over columns
        tl.Fn('Normalize', lambda x: normalize(x))  # Apply normalize function
    ) 
    # Run on Q1 and Q2 in parallel.
    model = tl.Parallel(q_processor, q_processor)
    return model

which is same as our pytorch architecture

    def forward(self, x1, x2):

        lstm_out_1, _ = self.em(x1)
        mean_layer_1 = lstm_out_1.mean(1)
        normalized_1 = F.normalize(mean_layer_1)

        lstm_out_2, _ = self.em(x2)
        mean_layer_2 = lstm_out_2.mean(1)
        normalized_2 = F.normalize(mean_layer_2)
        
        return normalized_1, normalized_2

In case of loss , I also tried to use other distance losses such as torch.norm or torch.dist to see if grad is giving good, but gave same tiny result , It look same as before. Here is that framework hardtriplet loss but My pytorch loss is right as this, and gave same result when testing .

Summary

def TripletLoss(v1, v2, margin=0.25):
    scores = fastnp.dot(v1, v2.T)  # pairwise cosine sim
    batch_size = len(scores)
    positive = fastnp.diagonal(scores) # the positive ones (duplicates)
    negative_without_positive = scores - 2.0 * fastnp.eye(batch_size)
    closest_negative = negative_without_positive.max(axis=1)
    negative_zero_on_duplicate =  scores * (1.0 - fastnp.eye(batch_size))
    mean_negative = fastnp.sum(negative_zero_on_duplicate, axis=1) / (batch_size - 1)
    triplet_loss1 = fastnp.maximum(0.0, margin - positive + closest_negative)
    triplet_loss2 = fastnp.maximum(0.0, margin - positive + mean_negative)
    triplet_loss = fastnp.mean(triplet_loss1 + triplet_loss2)
    return triplet_loss

Training this trax framework was working and gave good results but converting to pytorch I’m stuck but cant figure out, I am thinking where could be my problem because using different architecture also didnt solve problem of grad to be tiny or zero.Help!!

That None was not expected in my gradient, using different architecture removed but came all zeros instead.

ptrblck · October 12, 2020, 5:19am

I’m not familiar with trax and would recommend to check smaller code blocks for the right implementation including the gradients.
I.e. isolate layers and compare the output as well as the gradients for both frameworks using a constant input and the same parameters.

Since you are dealing with LSTM modules, I would start with them and make sure the dimensions are the same in both implementations.

mathematics · October 13, 2020, 6:14pm

Hi,

I was half able to figure out small reason why .grad was giving weird not updating grads. I was using batch=2 which was too small then used batch=256 then parameters were updating
Although they are not working are as tiny as before but .parameters() returned all false for longer epochs.

Another reason was my model was quite understandable, … some tweaking did quite changed how .grad like giving nan, zero, tiny but Still have one problem,

I have one question doubt to ask,

How can i return a layer that computes mean values using one tensor axis.
= I assigned operation .mean(1) from previous and return that variable,but i am thinking it doesn’t return a mean layer , s it relevant or is there any layer to do ,I didn’t seen in docs

Thank you for you response.