[resolved] Risk of bug in PyTorch! Weird performance of PyTorch vs. Theano


I’ve been recently trying to implement UNet using PyTorch. You can find the implementation here

I’m encountering a so weird problem here!
The networks apparently converges (reduction in loss value) but when I start to inference on some query images (from training set), the output is very poor, more often blank image. I’m using Sigmoid/BCELos() and Adam optimiser. The data and other stuff have been examined to be as what expected.

On the other hand, I have a successful implementation of UNet using theano, and I tried my best to keep the training procedure of two networks all the same (e.g. fixed training set, fixed hyper-parameters and so on), but very poor performance with Pytorch!

Others have also shared the same issue with their implementations! like this

I do not see any other reason for poor performance except bugs with PyTorch backends!!
Do you guys have any comment on the implementation?


I wrote some relatively simple semantic segmentation code that may be of use to you: https://github.com/Kaixhin/FCN-semantic-segmentation

However, I too haven’t managed to get great results - I do get OK results though, and never blank outputs. Do you have a link to the Theano implementation that matches your PyTorch one? The closer they are, the easier it will be to pick up any discrepancies.

Thanks for your reply.

Yes, here is the code for saliency detection, which is in essence the same as binary segmentation problems.


I had a quick look for you, and noticed that in their network they use Upscale2DLayer, which seems to be a fixed upsampling operation, like nearest neighbour upsampling. Your network uses nn.ConvTranspose2d, which is a learned upsampling operation. Some people have reported better success with this. Also, I do not see batch norm in their network, but you have it in your network.

Also, in your description, you wrote that you use Adam, but in your code I see SGD + momentum instead.

Thanks for your reply.

Actually, I’ve modified their work to fit my domain, and so the pytorch work is comparable to this one:

You can see that here I’ve switched the network to UNet which used ConvTranspose. I’ve also added BatchNorm to the unet model.
In addition, either SGD or Adam does not impact the problem.

I’ve successfully implemented semantic segmentation models in pytorch, so I think that should not be a bug in PyTorch.
So, there a few things I’d check in your code:

  • it seems that you use inputs in the range 0-255, without any normalization. is it intentional? I’d at least put them in the 0-1 range, and eventually subtract 0.5 or the dataset std.
  • batch normalization with batch size 2 doesn’t look right to me, might introduce a very large bias because the batch size is so small. I’d say to use batch norm sucessfully you’d need batch sizes of 128 or 256
  • it seems that you don’t use pre-trained networks for initializing your model? This could lead to reduced performances

In your implementation, do you use ConvTranspose2d or Upsampling?
Regarding your points:

  • I intentionally changed the range to 0-255 since the range 0-1 resulted the same problem.
  • The problem persists with or without batch normalization. However, I’m using batch size of 16 in my recent attempts.
  • I think it’s not fair to relate the problem to pre-trained initialization as there is the same theano implementation which is trained from scratch and works like a charm!!

Thanks for you reply.

I finally found the problem!!
For the last set of convolutions, that is 128-> 64 -> 64 -> 1, the activation function should not be used!
The activation function causes the values to vanish!

I just removed the nn.ReLU() modules on top of these convolution layers and now everything works fine!


It’s great that you managed to solve your problem. Would you mind summarising a) what worked for you and b) what didn’t work for you/errors so that other people can benefit?