Getting inconsistent CUDA results for identical function?

I noticed incorrect behavior in our project and I’ve done following experiment shows the CUDA result surprisingly inconsistent for identical function and inputs.
It would be appreciated if anyone knows the workaround.

My system:
pytorch 0.4.0
cuda 10.1
python 3.5.2

I knew this is not a recommended combination but our project does not compatible to newer versions of pytorch that we have to stick to an earlier version of pytorch.

Hi,

I cannot reproduce this with latest pytorch:

In [1]: import torch

In [2]: from torch import nn

In [3]: net = nn.Linear(2, 1).cuda()
data
In [4]: data = torch.randn(16, 2).cuda()

In [5]: ou1 = net(data)

In [6]: ou2 = net(data)

In [7]: ou1 - ou2
Out[7]: 
tensor([[0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.]], device='cuda:0', grad_fn=<SubBackward0>)

This can either be a bug that has been fixed or a install/hardware issue.
Do you see the same thing when running on a different machine?

No, it is not able to reproduce in our older machines.

Both machines has similar configuration except 2080 machine has the issue but not in 1080 machine.

Ho you actually compiled from source a pytorch 0.4 for cuda 10?

I used the following command to install pytorch.

pip3 install https://download.pytorch.org/whl/cu91/torch-0.4.0-cp35-cp35m-linux_x86_64.whl

Since there is no cuda 10 + torch 0.4.0, I choose the closest one.

@ngimel could it be causing an issue using a binary for cuda 9.1 and run with cuda 10?

2080 is not expected to work with cuda 9.1 build of pytorch.

1 Like