Different machines, different results

Riccardo_De_Feo · October 21, 2020, 9:52am

Hello,
I am experiencing a strange issue with replicating the same results on two different platforms.
I trained a neural network for medical imaging segmentation on a supercomputer using an Nvidia Volta V100 GPU, torch version 1.4.0, CUDA 10.1, evaluating its results using Dice scores.
When a reviewer asked for additional measures of the network’s accuracy, I downloaded the code on my machine and generated its outputs again, obtaining an entirely different, higher Dice score result (average of ~0.95 vs ~0.82).

I can easily replicate this result on both platforms: running exactly the same code (by copying and pasting the full directory), loading the same weights, I still obtain two different values. In my local environment I am also using version 1.4.0 and CUDA 10.1, of course using a different GPU (GeForce RTX 2080 Ti). I also made sure to test this by having the same version of numpy in both environments (1.18.1).

Edit: I am also running the same python version: 3.7.6, with the only difference that I am running it in a conda environment on my local machine. The only difference could be the GCC version: 7.3.0 on the V100 machine, 7.5.0 on mine.

Do you have any idea where this difference could be coming from?

albanD · October 21, 2020, 2:27pm

Hi,

This is expected behavior I’m afraid.
Different machines will have different hardware/software stacks that can lead to very small differences in floating point ops results. When training a neural net, such error will usually be increased by the training process, leading to different final results.
But if your training is stable, it should converge to very similar loss function values.

Riccardo_De_Feo · October 21, 2020, 5:04pm

My problem here is that there is no training involved. I am quite literally just loading the same weights and running the inference on the test set.

albanD · October 21, 2020, 5:50pm

How large is the difference then? A small(ish) difference is expected even for inference only

Riccardo_De_Feo · October 21, 2020, 6:40pm

By replicating the evaluation on the same system (V100) I do get a very small difference: average Dice overlap of 0.8229 vs 0.8243. That would be fine. On my local system I get 0.9527. Now, I like having a better result, but this difference looks pretty big.

albanD · October 21, 2020, 6:55pm

There might be other issues as well if you don’t have the exact same data on both machines? (if you did preprocessing one one dataset but not the other?)
But from pytorch side, you should expect to see small differences but nothing very big, in particular you can check that by ensuring that the forward passes give almost the same result for the same input?

Riccardo_De_Feo · October 21, 2020, 9:59pm

I eventually found out that while code and files were exactly the same, the ordering of the files was different on the remote server, which caused the discrepancy. As usual, the actual issue is a simple mistake thank you for your time!