Model not converging moving from Keras to Pytorch, faulty ReLU

I recently moved my regression model from Keras to Pytorch and I have been getting much worse results in Pytorch, to say the least. At first the model wasn’t even converging, it was just getting worse with each epoch(the std of the error kept growing both on the training set and the validation set). This was happening even though I had taken the utmost care to have the weight/bias initializations, optimizer and loss function parameters, learning rate and batch size exactly the same as I had them in Keras.

But then, as I was trying out different architectures to see what the problem was, I accidentally forgot to add the ReLU activation to one of them and my mistake somehow made the model converge! After realizing this I tried removing the ReLU activations from the other architectures as well and they all started converging.

Has anyone else been experiencing the same problem? Is there something wrong with ReLU in Pytorch?

Also, as a side note, I haven’t been able to find the source code for torch.relu(), if anyone knows where to find it and could share a link it would be very helpful. Thank you!

There’s nothing wrong with the ReLU implementation. It’s widely used and pretty simple:

(ReLU calls threshold)

CPU: https://github.com/pytorch/pytorch/blob/3b1c3996e1c82ca8f43af9efa196b33e36efee37/aten/src/ATen/native/cpu/Activation.cpp#L33
CUDA: https://github.com/pytorch/pytorch/blob/3b1c3996e1c82ca8f43af9efa196b33e36efee37/aten/src/ATen/native/cuda/Activation.cu#L290

The choice of derivative for torch.relu at 0 may vary between frameworks. The subgradient includes the interval [0,1]. PyTorch uses 0 for the derivative. I think TensorFlow also uses 0, but other frameworks might use 1. (Values between 0 and 1 are also subgradients, but would make for an awkward choice.)

(Image from https://medium.com/@danqing/a-practical-guide-to-relu-b83ca804f1f7)

Also removing non linearities (like ReLU) typically makes models easier to optimize, but less powerful. If you only have linear functions (or more precisely affine) like nn.Linear, nn.Conv2d, and + than you can only learn a linear (affine) function of your inputs.

My guess is that there’s some difference in the interpretation of optimizer parameters (like momentum) plus maybe some other small differences. I remember that there’s a different interpretation of momentum in PyTorch vs. Caffe2, but I don’t remember how it differs from Caffe2.

For SGD, I think dampening=True more closely matches the Caffe2 behavior:
https://pytorch.org/docs/stable/optim.html#torch.optim.SGD

@colesbury Thank you for such a quick response. Although the model is still yielding problematic results in Pytorch using ReLU, I was at least able to get it to converge using selu instead. I guess hoping the model would behave the same in different frameworks was an unrealistic expectation in the first place.