Inconsistent results with 3d MaxPool on GPU

I am experiencing some inconsistent predictions when running the same file multiple times. I followed the guide to set all the seeds and enable/disable the respective cudnn flags.
Training a small 3D network with some pooling layers on the CPU runs smoothly and the results between multiple runs are consistent. When training on the GPU the network predictions change after a few iterations even though the exact same network (and initialization) is used. Excluding the pooling layers fixes the problem. Furthermore, I can not observe such inconsistencies when using an equivalent 2D model (with pooling layers).
Below is a generic 3D examples which replicates my observations on my system.

Note: I’m not referring to differences between the CPU and the GPU run, I’m referring to multiple runs on the CPU/GPU respectively.

Tested on:
Ubuntu 18.04
Tried with: Torch 1.0post2 with cuda 10
Torch 0.4.1 with cuda 9
Nvidia GTX 980

Edit: updated description and replaced 3d resnet examples with much simpler/smaller network.

Thanks for the code snippet!
I debugged it a bit and think you are seeing some non-determinism due to some atomic operations.
From the reproducibility docs:

There are some PyTorch functions that use CUDA functions that can be a source of non-determinism. One class of such CUDA functions are atomic operations, in particular atomicAdd , where the order of parallel additions to the same value is undetermined and, for floating-point variables, a source of variance in the result. […]
A number of operations have backwards that use atomicAdd , in particular torch.nn.functional.embedding_bag() , torch.nn.functional.ctc_loss() and many forms of pooling, padding, and sampling. There currently is no simple way of avoiding non-determinism in these functions.

I’ve compared two runs for a single epoch and since the differences are increasing I assume this might be due to the atomicAdd in the backward function of your pooling layers.
Here are the differences for the predictions and losses:

print((preds1 - preds2).abs())
> tensor([0.0000e+00, 0.0000e+00, 1.1921e-07, 2.6822e-07, 2.9802e-08, 1.6391e-07,
        1.3411e-07, 1.4901e-07, 2.9802e-07, 4.7684e-07, 8.3447e-07, 2.3842e-07,
        1.3113e-06, 5.9605e-07, 2.3842e-07, 4.7684e-07, 4.1723e-06, 6.8545e-06,
        9.6858e-06, 1.2815e-05, 4.9993e-06, 6.1840e-06, 5.8860e-06, 6.7577e-06,
        6.6683e-06, 5.5283e-06, 5.8264e-06, 6.8545e-06, 7.7151e-06, 8.6427e-06,
        7.5623e-06, 6.1095e-06, 5.7220e-06, 1.9968e-06, 2.0415e-06, 9.5367e-07,
        1.5795e-06, 2.2650e-06, 1.2517e-06, 4.8578e-06, 5.9009e-06, 1.0103e-05,
        1.5825e-05, 2.7329e-05, 2.5600e-05, 3.5465e-05, 2.5690e-05, 1.8418e-05,
        5.1737e-05, 9.8228e-05])
print((losses1 - losses2).abs())
> tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 5.9605e-08, 0.0000e+00, 5.9605e-08,
        5.9605e-08, 5.9605e-08, 1.4901e-07, 8.9407e-08, 1.4901e-07, 1.1921e-07,
        1.0729e-06, 1.0431e-07, 2.3842e-07, 2.3842e-07, 2.9802e-06, 4.5300e-06,
        5.9009e-06, 5.7220e-06, 2.6226e-06, 3.0398e-06, 3.0398e-06, 3.3379e-06,
        3.2783e-06, 2.9802e-06, 3.0994e-06, 3.6359e-06, 3.9935e-06, 4.3511e-06,
        3.6359e-06, 3.2187e-06, 2.6822e-06, 1.1325e-06, 8.3447e-07, 3.5763e-07,
        8.9407e-07, 8.9407e-07, 5.9605e-07, 2.8014e-06, 3.6955e-06, 3.9935e-06,
        9.5367e-06, 1.1146e-05, 1.0610e-05, 1.4484e-05, 1.0252e-05, 7.1228e-06,
        3.2902e-05, 3.5226e-05])
1 Like

Thank you for your quick response!
I guess you are right, just wanted to double check. I was a bit surprised by the fact, that this only happens when 3d pooling is used and not with 2d pooling (or an equivalent 2d version of the network).

Not, sure why this happens in the 3d case but not in 2d, but atomicAdd will be used, if the kernel_size is not equal to the stride as described in these lines of code.
If you change your nn.MaxPool3d layers to nn.MaxPool3d(kernel_size=2, stride=2, padding=0), you should get deterministic results.

1 Like

Thank you, good to know :slight_smile: