Inconsistent results with 3d MaxPool on GPU

Michael_Baumgartner · February 28, 2019, 2:00pm

I am experiencing some inconsistent predictions when running the same file multiple times. I followed the https://pytorch.org/docs/stable/notes/randomness.html guide to set all the seeds and enable/disable the respective cudnn flags.
Training a small 3D network with some pooling layers on the CPU runs smoothly and the results between multiple runs are consistent. When training on the GPU the network predictions change after a few iterations even though the exact same network (and initialization) is used. Excluding the pooling layers fixes the problem. Furthermore, I can not observe such inconsistencies when using an equivalent 2D model (with pooling layers).
Below is a generic 3D examples which replicates my observations on my system.

Note: I’m not referring to differences between the CPU and the GPU run, I’m referring to multiple runs on the CPU/GPU respectively.

gist.github.com

https://gist.github.com/mibaumgartner/59f0be2e0a667bb9c1bec1106e49989e

Example

set_seed = 0
DEVICE = 'cuda'

import torch
torch.manual_seed(set_seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
import torch.nn as nn

import random

This file has been truncated. show original

Tested on:
Ubuntu 18.04
Tried with: Torch 1.0post2 with cuda 10
Torch 0.4.1 with cuda 9
Nvidia GTX 980

Edit: updated description and replaced 3d resnet examples with much simpler/smaller network.

ptrblck · March 1, 2019, 11:32pm

Thanks for the code snippet!
I debugged it a bit and think you are seeing some non-determinism due to some atomic operations.
From the reproducibility docs:

There are some PyTorch functions that use CUDA functions that can be a source of non-determinism. One class of such CUDA functions are atomic operations, in particular atomicAdd , where the order of parallel additions to the same value is undetermined and, for floating-point variables, a source of variance in the result. […]
A number of operations have backwards that use atomicAdd , in particular torch.nn.functional.embedding_bag() , torch.nn.functional.ctc_loss() and many forms of pooling, padding, and sampling. There currently is no simple way of avoiding non-determinism in these functions.

I’ve compared two runs for a single epoch and since the differences are increasing I assume this might be due to the atomicAdd in the backward function of your pooling layers.
Here are the differences for the predictions and losses:

print((preds1 - preds2).abs())
> tensor([0.0000e+00, 0.0000e+00, 1.1921e-07, 2.6822e-07, 2.9802e-08, 1.6391e-07,
        1.3411e-07, 1.4901e-07, 2.9802e-07, 4.7684e-07, 8.3447e-07, 2.3842e-07,
        1.3113e-06, 5.9605e-07, 2.3842e-07, 4.7684e-07, 4.1723e-06, 6.8545e-06,
        9.6858e-06, 1.2815e-05, 4.9993e-06, 6.1840e-06, 5.8860e-06, 6.7577e-06,
        6.6683e-06, 5.5283e-06, 5.8264e-06, 6.8545e-06, 7.7151e-06, 8.6427e-06,
        7.5623e-06, 6.1095e-06, 5.7220e-06, 1.9968e-06, 2.0415e-06, 9.5367e-07,
        1.5795e-06, 2.2650e-06, 1.2517e-06, 4.8578e-06, 5.9009e-06, 1.0103e-05,
        1.5825e-05, 2.7329e-05, 2.5600e-05, 3.5465e-05, 2.5690e-05, 1.8418e-05,
        5.1737e-05, 9.8228e-05])
print((losses1 - losses2).abs())
> tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 5.9605e-08, 0.0000e+00, 5.9605e-08,
        5.9605e-08, 5.9605e-08, 1.4901e-07, 8.9407e-08, 1.4901e-07, 1.1921e-07,
        1.0729e-06, 1.0431e-07, 2.3842e-07, 2.3842e-07, 2.9802e-06, 4.5300e-06,
        5.9009e-06, 5.7220e-06, 2.6226e-06, 3.0398e-06, 3.0398e-06, 3.3379e-06,
        3.2783e-06, 2.9802e-06, 3.0994e-06, 3.6359e-06, 3.9935e-06, 4.3511e-06,
        3.6359e-06, 3.2187e-06, 2.6822e-06, 1.1325e-06, 8.3447e-07, 3.5763e-07,
        8.9407e-07, 8.9407e-07, 5.9605e-07, 2.8014e-06, 3.6955e-06, 3.9935e-06,
        9.5367e-06, 1.1146e-05, 1.0610e-05, 1.4484e-05, 1.0252e-05, 7.1228e-06,
        3.2902e-05, 3.5226e-05])

Michael_Baumgartner · March 2, 2019, 10:45am

Thank you for your quick response!
I guess you are right, just wanted to double check. I was a bit surprised by the fact, that this only happens when 3d pooling is used and not with 2d pooling (or an equivalent 2d version of the network).

ptrblck · March 2, 2019, 2:53pm

Not, sure why this happens in the 3d case but not in 2d, but atomicAdd will be used, if the kernel_size is not equal to the stride as described in these lines of code.
If you change your nn.MaxPool3d layers to nn.MaxPool3d(kernel_size=2, stride=2, padding=0), you should get deterministic results.

Michael_Baumgartner · March 3, 2019, 2:55pm

Thank you, good to know