Custom layer gets same weights in every training iterations

Hello, everyone
I want to make a custom regularization layer with Pytorch but something is wrong to my regularization layer because the loss output is all same when training.
The real problem is that I found out myloss gets same net.parameters() in every training process but, I do not know why it gets same weight parameters even if I give updated network

My custom layer is like below

class bipolar_loss(nn.Module):

def __init__(self, lambd=5e-7):
    super(bipolar_loss, self).__init__()
    self.lambd = lambd

def forward(self, net):

    loss = 0
    for param in net.parameters():
        # Only for weight parameters
        if len(param.size()) == 4:
            loss += (1 - torch.pow(param, 2)).sum()
    return loss * self.lambd

And I use this layer in like below

criterion = nn.CrossEntropyLoss().cuda()
loss_func = bipolar_loss().cuda()

In the training process, what I want to do is to add two losses like L2 regularization so I add two losses in every training iterations

for batch_idx, (inputs, targets) in enumerate(trainloader):
    inputs, targets =,
    # forward pass to get output / logits
    outputs = net(inputs)

    # calculate loss *default is cross entropy loss
    loss = criterion(outputs, targets)
    myloss = loss_func(net)

    loss = loss + myloss

    # getting gradients parameters
    # updating parameters
    train_loss += loss.item()
    _, predicted = outputs.max(1)
    total += targets.size(0)
    correct += predicted.eq(targets).sum().item()

epoch_loss = train_loss/(batch_idx+1)
epoch_acc = 100.*correct/total

However, I got same loss when printing the output of myloss

iteration 0:
loss: tensor(2.2010, device=‘cuda:6’, grad_fn=NllLossBackward)
myloss: tensor(2.3100, device=‘cuda:6’, grad_fn=MulBackward0)
total_loss: tensor(4.5111, device=‘cuda:6’, grad_fn=AddBackward0)

iteration 1:
loss: tensor(2.2096, device=‘cuda:6’, grad_fn=NllLossBackward)
myloss: tensor(2.3100, device=‘cuda:6’, grad_fn=MulBackward0)
total_loss: tensor(4.5196, device=‘cuda:6’, grad_fn=AddBackward0)

myloss is all same in every iterations and epochs…

Don’t use in-place operations like “+=”. They don’t play nicely with autograd, you can read more here.

So instead of:
loss += (1 - torch.pow(param, 2)).sum()
loss = loss + (1 - torch.pow(param, 2)).sum()

Try this change and see if it affects what you’re seeing.

1 Like

Thank for your kind advice!
I checked the result changing the in-place operation to loss = loss + ~, but the result is same (same myloss). The critical problem is that the layer gets same parameters in every training process.
But of course, my goal is to make myloss correctly.

Hmm, that’s interesting. It might be that PyTorch isn’t recording the operations done on it because loss inside bipolar_loss is a python scalar.

Instead of:
loss = 0
loss = torch.tensor(0.0, requires_grad=True)

Let me know if that does something different.

1 Like

oh… I got error like this

File “”, line 134, in train
File “/home/sangwooj/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/”, line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/sangwooj/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/autograd/”, line 98, in backward
RuntimeError: Function AddBackward0 returned an invalid gradient at index 0 - expected type TensorOptions(dtype=float, device=cuda:6, layout=Strided, requires_grad=false) but got TensorOptions(dtype=float, device=cpu, layout=Strided, requires_grad=false) (validate_outputs at /opt/conda/conda-bld/pytorch_1587428094786/work/torch/csrc/autograd/engine.cpp:484)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7feb06c6eb5e in /home/sangwooj/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/
frame #1: + 0x2ae2834 (0x7feb30932834 in /home/sangwooj/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/
frame #2: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x548 (0x7feb30934368 in /home/sangwooj/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/
frame #3: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7feb309362f2 in /home/sangwooj/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/
frame #4: torch::autograd::Engine::thread_init(int) + 0x39 (0x7feb3092e969 in /home/sangwooj/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/
frame #5: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7feb33c6f9f8 in /home/sangwooj/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/
frame #6: + 0xc819d (0x7feb366c719d in /home/sangwooj/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib/…/…/…/…/./
frame #7: + 0x76ba (0x7feb4ecb56ba in /lib/x86_64-linux-gnu/
frame #8: clone + 0x6d (0x7feb4e9eb41d in /lib/x86_64-linux-gnu/

Okay that’s weird, give me a sec I’m going to see if I can reproduce it locally…

1 Like

Okay ignore the second suggestion I made. I think it resulted in that error when using CUDA.

I think your loss function is actually working, but the gradients as a result are incredibly small. In my tests I’m seeing gradients as small as 1e-8. Which when using a learning rate such as 1e-2, the update to your weights are incredibly small. So your loss is appearing to be unmoving, but it actually is. Set your lambd to some larger number for testing, such as 10.0. I think you’ll see a change in your loss now.

1 Like

Thanks for your comment!
Does myloss output have the same value in your code? In my training process, myloss value is all same

I made a simple model that has one layer (linear, so I adjusted your if condition) and the only objective function was yours. I did see that it was unchanging at 0.005 after an optimization step until I bumped up lambd, then I saw that it was changing.

Even with bumping your lambd value up you don’t see a change in your myloss term?


You are right… If I change the lambd to 0.005, then myloss is changed very slightly in every iterations.
Thanks for your help. But, the result is not what I wanted :sob: because it is too large comparing the difference between before and after (like iteration1: myloss 23083.1270, iteration 2: myloss 23083.1640)
I think I have to consider why it happens like that. Thank you!

Keep in mind that loss values are completely arbitrary. For some loss functions, a value of 10 is small and for other loss functions, a value of 10,000 is small.

Since your loss term is the sum of every parameter value, we can expect your loss term to grow with the number of parameters. So it’s fine to have a loss value of say 20,000, the important thing is that your gradients aren’t insane.

If you prefer to have a smaller value, perhaps consider taking the average per parameter group. (so mean() instead of sum()). If you work out the math, we’ll find that the gradients have the same proportions in respect to each parameter value within the same parameter groups, just at different scales. So it’s almost the same for all intents and purposes. I would suggest taking the average because a “low” value will be the same regardless of size of model.

Also keep in mind that when you have 2 or more loss objectives that they might conflict with eachother! As in, reducing loss1 may result in an increase of loss2.