Why same code behaves differently?

Dzhange · July 11, 2020, 4:14pm

Hi
My model used to perform well with batch size 2, but since someday the loss looks like this with batch size 2(The code and data are the same!).

However, if I change the batch size into 16, the loss curve looks well, but the loss drops much slower than previous training.

The model I used is xnocs/models/SegNet.py at master · drsrinathsridhar/xnocs · GitHub
And the loss function was tk3dv/tk3dv/ptTools/loaders/GenericImageDataset.py at 2028b5ea77aca3410d9bf794daf14566ac6bf589 · drsrinathsridhar/tk3dv · GitHub

In another scenario, where I used the model for a harder object, the result with batch 2 will always converge to an all-white image(used to perform well), and when changing batch size into 16 the result became right but with high loss.

I’ve been struggled for days to figure out the reason but failed. At first, I thought this is caused by environmental problems but this bug exists in EVERY env. Can anyone give any suggestions?

Lots of thanks!

Shisho_Sama · July 12, 2020, 7:44am

Its very possible that you made a mistake back then. It happens. you think you set it to sth like 2, but infact it might have really been 16, you trained it with 16 and then sometime later, changed it back to 2 (or maybe changed sth else, used a higher dropout ratio, weight decay, increased or decreased a featuremap, etc) and forgot about it. in order to be sure you indeed ran on batch of 2, look at your logs if you have one.
Another scenario is, the random nature of the seed you use. try using the same seed for your operations and see how that goes.

Dzhange · July 12, 2020, 8:17am

Thanks for the reply!
I’m sure that I’ve been always using batch 2 and the hyperparameters are also fixed.
Also the random seed is specified to 0 each time

Dzhange · July 12, 2020, 4:30pm

On my initial post I said the code won’t work in any environment, but I just noticed that the results of this model vary from different environments. On one AWS server(with DeepLearning AMI, version 20, ubuntu 16.04), the model did give non-all-white output with batch size of 2. But the same setting just won’t work on another machine(All the difference I can tell is the Nvidia driver, one is 410.79 and another is 410.129).

Could this be a bug of pytorch?