I am training an unsupervised CNN encoder decoder network for dense output (optical flow). The network trains perfectly fine in version 1.4 and gives proper output. However, in Pytorch 1.6, with the same configuration, the trained network gives constant output (one value in the whole image)
It is very hard to say without more information. There are almost ten thousand commits between 1.4 and 1.6
Can you explain what you did change when doing the upgrade?
What have you tried to identify the difference?
How stable was the training in 1.4?
Well, first I thought the problem was in distributed training. So, I disabled that and ran on single GPU. Then I thought it was the network architecture. I tried other architectures but same issue.
Finally, I concluded that this is not because of any of the above but because of the L2 regularization I was using with the Adam optimizer. I was using weight decay with 0.0005 as the weight but I think it was too large weight.