Random behavior of nn.Conv2d on 1080ti (with intel) but NOT P100 (with ibm power8)

gwding · April 17, 2018, 8:19pm

I’ve encountered a problem of gradients of conv2d layer having random behavior on GPU. It only happens on GPU with certain hyperparameters.
Specifically, for the same input and same network, the gradients are not exactly the same every time.

import torch
import numpy as np
from torch.autograd import Variable
import torch.nn as nn


NumChannels = 32


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(
            NumChannels, NumChannels, kernel_size=3, stride=1)
        self.conv2 = nn.Conv2d(
            NumChannels, NumChannels, kernel_size=3, stride=2)

    def forward(self, x):
        out = x
        out = self.conv1(out)
        out = self.conv2(out)
        return out


if __name__ == '__main__':
    batch_size = 11
    np.random.seed(6)
    torch.manual_seed(6666)
    inputs = np.random.uniform(0, 1, size=(batch_size, NumChannels, 32, 32))
    inputs = torch.from_numpy(inputs.astype(np.float32))

    model = Net()
    model.eval()

    prev_gradsum = 0
    prev_outputsum = 0

    model.cuda()
    inputs = inputs.cuda()
    for ii in range(100):

        xvar = Variable(inputs, requires_grad=True)
        output = model(xvar)
        loss = (output ** 4).sum()
        loss.backward()

        if prev_gradsum != 0:
            assert xvar.grad.data.sum() == prev_gradsum, \
                (xvar.grad.data.sum(), prev_gradsum, ii)
            assert output.data.sum() == prev_outputsum, prev_outputsum
        prev_gradsum = xvar.grad.data.sum()
        prev_outputsum = output.data.sum()

The assert will return

Traceback (most recent call last):
  File "test_wrn.py", line 50, in <module>
    (xvar.grad.data.sum(), prev_gradsum, ii)
AssertionError: (963.59619140625, 963.5963134765625, 48)

This does NOT happen on CPU, and does NOT happen on P100 cards on a IBM Minsky Power8 machine. But happens on 1080TI cards on different Intel machines.

We 0.4.0a0+b21e135 for both 1080ti and p100. and
0.3.1.post2 for 1080ti

Is this something to be expected?

ptrblck · April 17, 2018, 8:26pm

Could you try to disable cudnn, since it has some non-deterministic behavior:

torch.backend.cudnn.enabled=False

ngimel · April 17, 2018, 8:27pm

Backward of the convolution in cudnn is not guaranteed to be deterministic, this is expected.