Model converges in Keras and not in PyTorch -> dying ReLU

segum · April 22, 2019, 6:13pm

I am trying to implement a CNN for regression on images in PyTorch. I have a working model already implemented in Keras and I would like to translate it in PyTorch, but I am facing many issues.

Essentially, in Keras the model converges, whereas in PyTorch it doesn’t.

In PyTorch I always obtain a constant training loss and the trained model always outputs the same value for any image, regardless of learning rate, batch size and apparently also for slightly different initializations, it always converges to the same loss: 0.0745.

I initialized all the layers as in the working Keras model, I added l2 regularization as in Keras, I also implemented the same learning rate decay. Everything looks exactly the same, but in PyTorch my model doesn’t converge.

After some debugging I found out that the last relu of my trained model is dead. This happens only to the last one, previous to that one all of the others seems to be working. How would you suggest to solve it? I was thinking of trying leaky relu/initialization of biases of conv layers to 0.1/directly removing the last relu.

rasbt · April 22, 2019, 10:59pm

Does it include using the same weight initialization scheme?

segum · April 22, 2019, 11:05pm

If in Keras it was specified kernels initialization with He normal, in PyTorch I initialized the corresponding kernels with kaiming normal. Otherwise I used Xavier initialization in PyTorch in order to emulate default Keras initialization.

I also initialized all the biases to zero because, if I’m not wrong, it is Keras default initialization.

rasbt · April 23, 2019, 1:13am

Hm, what loss function do you use? Can you share the code snippet/line that implements and computes the loss?

segum · April 23, 2019, 7:35am

def compute_l2_reg(model,model_name):
    # Function that sets weight_decay only for weights and not biases and only 
    # for conv layers inside residual layers
    lambda_ = FLAGS.weight_decay
    params_dict = dict(model.named_parameters())
    l2_reg=[]  
    if model_name == 'resnet8':
        for key, value in params_dict.items():
            if ((key[-8:] == '2.weight' or key[-8:] == '5.weight') and key[0:8]=='residual'):
                l2_reg += [lambda_*torch.norm(value.view(value.size(0),-1),2)]
    else:
        for key, value in params_dict.items():
            if ((key[-8:] == '2.weight' or key[-8:] == '6.weight') and key[0:8]=='residual'):
                l2_reg += [lambda_*torch.norm(value.view(value.size(0),-1),2)]
    l2_reg = sum(l2_reg)
    return l2_reg

This is how I compute l2 regularization to emulate same kernel regularization as in Keras. Then I sum l2_reg to mse loss (it’s a regression problem):

loss = F.mse_loss(outputs, targets) + l2_reg

rasbt · April 23, 2019, 3:31pm

Hm, can you check if outputs and targets have the same shape? E.g.,

assert len(outputs.size()) == 1
assert targets.size() == outputs.size()

because of things like

In [2]: import torch                                                                                                                     

In [3]: a = torch.tensor([1., 2., 3.])                                                                                                   

In [4]: torch.nn.functional.mse_loss(a, a)                                                                                               
Out[4]: tensor(0.)

In [5]: torch.nn.functional.mse_loss(a, a.view(-1, 1))                                                                                   
Out[5]: tensor(1.3333)

I think you need to average the L2 component because by default mse_loss is averaging the loss. I.e,.
l2_reg = sum(l2_reg) ==> l2_reg = sum(l2_reg) / l2_reg.size(0)

segum · April 23, 2019, 3:59pm

Thank you very much, I think that this could be my mistake!

In my code:

outputs.size() returns torch.Size([32,1])

and

targets.size() returns torch.Size([32])

Should I transform outputs.size() to torch.Size([32]) or targets.size() to torch.Size([32,1])? Or it’s the same?

Thank you very much, I hope this is going to solve my problem. And thanks also for noticing the problem with l2_reg

rasbt · April 23, 2019, 4:15pm

I think it’s the same, but I would do outputs.view(-1).

segum · April 23, 2019, 4:28pm

Thanks again, I will let you know if this solved my problem.

rasbt · April 24, 2019, 1:04am

I would be curious to know wether this fixes it. As the last layer usually returns a [num_batch,1]-dimensional output, the mismatch in dimensions is unfortunately a common trap when using mse_loss. There was a brief discussion with @smth a while ago on twitter to not allow dimension mismatches in mse_loss to avoid this.

segum · April 24, 2019, 7:02am

Yes, it does. Thanks again for your help. I would have never thought that this was the problem, it’s not so easy to spot. If there isn’t any counterside, I would recommend to not allow this behaviour with MSE loss or maybe print a warning message about the possible issue with dimension mismatch.

segum · April 24, 2019, 7:37am

Regarding your suggestion about L2 component, shouldn’t I divide it by the batch_size instead of the number of L2 components?

EDIT: in order to copy the exact behaviour of Keras kernel_regularizer, I think that L2 regularization component should not be divided by any value at all

ptrblck · April 24, 2019, 11:19am

This issue is being tracked here as we’ve seen it a few times already in this discussion board.
I’m a fan of just a hard error on a shape mismatch, but a warning should be at least also sufficient.

rasbt · April 24, 2019, 2:27pm

I think a hard error would also work :). For those who really want to compute MSE with broadcasting, it is simple enough to write a one-liner like torch.mean((a-b)**2).

smth · April 28, 2019, 4:16pm

A warning is issued in such situations with pytorch master branch. It’ll be part of the next release.

rasbt · April 28, 2019, 8:10pm

Awesome, that’s great to hear! On a side-note, that was among the major friction point for my students as well, and the first thing I looked out for when helping them with debugging.