Model converges in Keras and not in PyTorch -> dying ReLU

I am trying to implement a CNN for regression on images in PyTorch. I have a working model already implemented in Keras and I would like to translate it in PyTorch, but I am facing many issues.

Essentially, in Keras the model converges, whereas in PyTorch it doesn’t.

In PyTorch I always obtain a constant training loss and the trained model always outputs the same value for any image, regardless of learning rate, batch size and apparently also for slightly different initializations, it always converges to the same loss: 0.0745.

I initialized all the layers as in the working Keras model, I added l2 regularization as in Keras, I also implemented the same learning rate decay. Everything looks exactly the same, but in PyTorch my model doesn’t converge.

After some debugging I found out that the last relu of my trained model is dead. This happens only to the last one, previous to that one all of the others seems to be working. How would you suggest to solve it? I was thinking of trying leaky relu/initialization of biases of conv layers to 0.1/directly removing the last relu.

Does it include using the same weight initialization scheme?

If in Keras it was specified kernels initialization with He normal, in PyTorch I initialized the corresponding kernels with kaiming normal. Otherwise I used Xavier initialization in PyTorch in order to emulate default Keras initialization.

I also initialized all the biases to zero because, if I’m not wrong, it is Keras default initialization.

Hm, what loss function do you use? Can you share the code snippet/line that implements and computes the loss?

def compute_l2_reg(model,model_name):
    # Function that sets weight_decay only for weights and not biases and only 
    # for conv layers inside residual layers
    lambda_ = FLAGS.weight_decay
    params_dict = dict(model.named_parameters())
    l2_reg=[]  
    if model_name == 'resnet8':
        for key, value in params_dict.items():
            if ((key[-8:] == '2.weight' or key[-8:] == '5.weight') and key[0:8]=='residual'):
                l2_reg += [lambda_*torch.norm(value.view(value.size(0),-1),2)]
    else:
        for key, value in params_dict.items():
            if ((key[-8:] == '2.weight' or key[-8:] == '6.weight') and key[0:8]=='residual'):
                l2_reg += [lambda_*torch.norm(value.view(value.size(0),-1),2)]
    l2_reg = sum(l2_reg)
    return l2_reg

This is how I compute l2 regularization to emulate same kernel regularization as in Keras. Then I sum l2_reg to mse loss (it’s a regression problem):

loss = F.mse_loss(outputs, targets) + l2_reg

Hm, can you check if outputs and targets have the same shape? E.g.,

assert len(outputs.size()) == 1
assert targets.size() == outputs.size()

because of things like

In [2]: import torch                                                                                                                     

In [3]: a = torch.tensor([1., 2., 3.])                                                                                                   

In [4]: torch.nn.functional.mse_loss(a, a)                                                                                               
Out[4]: tensor(0.)

In [5]: torch.nn.functional.mse_loss(a, a.view(-1, 1))                                                                                   
Out[5]: tensor(1.3333)
  1. I think you need to average the L2 component because by default mse_loss is averaging the loss. I.e,.
    l2_reg = sum(l2_reg) ==> l2_reg = sum(l2_reg) / l2_reg.size(0)
1 Like

Thank you very much, I think that this could be my mistake!

In my code:

outputs.size() returns torch.Size([32,1])

and

targets.size() returns torch.Size([32])

Should I transform outputs.size() to torch.Size([32]) or targets.size() to torch.Size([32,1])? Or it’s the same?

Thank you very much, I hope this is going to solve my problem. And thanks also for noticing the problem with l2_reg

I think it’s the same, but I would do outputs.view(-1).

Thanks again, I will let you know if this solved my problem.

I would be curious to know wether this fixes it. As the last layer usually returns a [num_batch,1]-dimensional output, the mismatch in dimensions is unfortunately a common trap when using mse_loss. There was a brief discussion with @smth a while ago on twitter to not allow dimension mismatches in mse_loss to avoid this.

Yes, it does. Thanks again for your help. I would have never thought that this was the problem, it’s not so easy to spot. If there isn’t any counterside, I would recommend to not allow this behaviour with MSE loss or maybe print a warning message about the possible issue with dimension mismatch.

Regarding your suggestion about L2 component, shouldn’t I divide it by the batch_size instead of the number of L2 components?

EDIT: in order to copy the exact behaviour of Keras kernel_regularizer, I think that L2 regularization component should not be divided by any value at all

This issue is being tracked here as we’ve seen it a few times already in this discussion board.
I’m a fan of just a hard error on a shape mismatch, but a warning should be at least also sufficient.

3 Likes

I think a hard error would also work :). For those who really want to compute MSE with broadcasting, it is simple enough to write a one-liner like torch.mean((a-b)**2).

1 Like

A warning is issued in such situations with pytorch master branch. It’ll be part of the next release.

2 Likes

Awesome, that’s great to hear! On a side-note, that was among the major friction point for my students as well, and the first thing I looked out for when helping them with debugging.

1 Like