Getting inconsistent CUDA results for identical function?

af3b04f0e6ef68be2e0e · April 23, 2019, 8:51am

I noticed incorrect behavior in our project and I’ve done following experiment shows the CUDA result surprisingly inconsistent for identical function and inputs.
It would be appreciated if anyone knows the workaround.

My system:
pytorch 0.4.0
cuda 10.1
python 3.5.2

I knew this is not a recommended combination but our project does not compatible to newer versions of pytorch that we have to stick to an earlier version of pytorch.

albanD · April 23, 2019, 9:27am

Hi,

I cannot reproduce this with latest pytorch:

In [1]: import torch

In [2]: from torch import nn

In [3]: net = nn.Linear(2, 1).cuda()
data
In [4]: data = torch.randn(16, 2).cuda()

In [5]: ou1 = net(data)

In [6]: ou2 = net(data)

In [7]: ou1 - ou2
Out[7]: 
tensor([[0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.]], device='cuda:0', grad_fn=<SubBackward0>)

This can either be a bug that has been fixed or a install/hardware issue.
Do you see the same thing when running on a different machine?

af3b04f0e6ef68be2e0e · April 23, 2019, 9:34am

No, it is not able to reproduce in our older machines.

Both machines has similar configuration except 2080 machine has the issue but not in 1080 machine.

albanD · April 23, 2019, 9:49am

Ho you actually compiled from source a pytorch 0.4 for cuda 10?

af3b04f0e6ef68be2e0e · April 23, 2019, 10:04am

I used the following command to install pytorch.

pip3 install https://download.pytorch.org/whl/cu91/torch-0.4.0-cp35-cp35m-linux_x86_64.whl

Since there is no cuda 10 + torch 0.4.0, I choose the closest one.

albanD · April 23, 2019, 10:34am

@ngimel could it be causing an issue using a binary for cuda 9.1 and run with cuda 10?

ngimel · April 23, 2019, 6:15pm

2080 is not expected to work with cuda 9.1 build of pytorch.