Problem transfering tensor between gpus

Dhorka · November 7, 2019, 10:38am

Hi,

I need help to understand this behaviour. I’m trying to move a tensor between gpus using the method .to(). However, when I move the tensor between gpu 0 and gpu 1 the resultant tensor on 1 only contains 0 as it can be seen here:

>>> import torch
>>> a= torch.rand(3,3)
>>> a
tensor([[0.6060, 0.0625, 0.8044],
        [0.1404, 0.2677, 0.0491],
        [0.4104, 0.7037, 0.1225]])
>>> b=a.to('cuda:0')
>>> b
tensor([[0.6060, 0.0625, 0.8044],
        [0.1404, 0.2677, 0.0491],
        [0.4104, 0.7037, 0.1225]], device='cuda:0')
>>> b.to('cuda:1')
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]], device='cuda:1')

This happens to titan x gpus. However if I use gtx1080 this work properly. Is there any explanation that explains this behaviour?

Thanks

albanD · November 7, 2019, 3:02pm

Do you see the same behavior if your do:

c = b.to('cuda:1')
torch.cuda.synchronize(1)
print(c)

?

Could you give more informations about your setup please as well. What is your cuda version. nvidia driver version. pytorch version and how you installed it? Thanks

This might be a missing sync on our side. cc @ngimel

ptrblck · November 7, 2019, 10:04pm

Does print(b.to('cuda:1')) change the behavior?

Dhorka · November 8, 2019, 6:59am

Hi,

I see the same using your code:

>>> import torch
>>> a= torch.rand(3,3)
>>> b=a.to('cuda:0')
>>> b
tensor([[0.9206, 0.9123, 0.1227],
        [0.5668, 0.4557, 0.0798],
        [0.1586, 0.2552, 0.2610]], device='cuda:0')
>>> c = b.to('cuda:1')
>>> torch.cuda.synchronize(1)
>>> print(c)
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]], device='cuda:1')

Could you give more informations about your setup please as well. What is your cuda version. nvidia driver version. pytorch version and how you installed it? Thanks

I’m using pytorch 1.3, cuda.10 and the nvidia driver is: 430.26. The gpu’s that I’m using are GeForce GTX TITAN X

Dhorka · November 8, 2019, 7:00am

Nop, I’m having the same behaviour. I’m was wondering that maybe the problem are my gpus, it can be?

albanD · November 8, 2019, 3:22pm

I tried with 1.3, cuda 10 and driver 410.79. But I can’t reproduce this…

This is not necessarily the gpus. It could also be some weird interaction between some libraries.
Let’s see what @ptrblck thinks.

ptrblck · November 8, 2019, 5:10pm

I’m also struggling to reproduce this issue and haven’t seen it yet.

@Dhorka

are you able to initialize a random tensor on GPU1 directly?
If you are using a larger tensor (let’s say 1 million values), is c still all zeros or does it contain random values (including Inf, NaN)?

Dhorka · November 9, 2019, 6:31am

are you able to initialize a random tensor on GPU1 directly?

I’m able to initialize a random tensor on gpu1:

>>> d = torch.rand(1000,1000, device='cuda:1')
>>> d
tensor([[0.7564, 0.9157, 0.0234,  ..., 0.7413, 0.5786, 0.8648],
        [0.9184, 0.4508, 0.7758,  ..., 0.8677, 0.3734, 0.0650],
        [0.8781, 0.6679, 0.6536,  ..., 0.7493, 0.5669, 0.2126],
        ...,
        [0.6117, 0.8818, 0.2546,  ..., 0.2776, 0.6110, 0.9790],
        [0.8194, 0.1569, 0.4981,  ..., 0.1380, 0.8166, 0.5869],
        [0.6612, 0.3502, 0.1820,  ..., 0.4930, 0.5999, 0.5068]],
       device='cuda:1')

If you are using a larger tensor (let’s say 1 million values), is c still all zeros or does it contain random values (including Inf, NaN)?

It is still all zeros:

>>> import torch
>>> a=torch.rand(1000,1000)
>>> b = a.to('cuda:0')
>>> b
tensor([[0.1744, 0.7191, 0.2494,  ..., 0.5766, 0.0246, 0.1605],
        [0.3086, 0.8365, 0.2884,  ..., 0.7067, 0.3236, 0.4698],
        [0.0198, 0.3654, 0.4967,  ..., 0.7522, 0.4083, 0.5647],
        ...,
        [0.9669, 0.7398, 0.4953,  ..., 0.8938, 0.7569, 0.9778],
        [0.9604, 0.5742, 0.6091,  ..., 0.9456, 0.5841, 0.5666],
        [0.8006, 0.1631, 0.5027,  ..., 0.2221, 0.5726, 0.2811]],
       device='cuda:0')
>>> c = b.to('cuda:1')
>>> c
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:1')

I’m also struggling to reproduce this issue and haven’t seen it yet.

How Can I help you to reproduce it?

I’m using ubuntu server 18 and python 3.6…I don’t think this is influencing but…
I’m not versed in these kinds of things, but if i’m not wrong this feature is using nvidia gpudirect, right? Is it possible to do a test outside pytorch to check if this feature is working in my setup?

albanD · November 11, 2019, 8:02pm

Do you installed the cuda samples when you installed CUDA?
If so could you try running 0_Simple/simplep2p (you can get the samples from the repo linked)?
This should perform similar kind of transfer and check that the values are properly copied.

Dhorka · November 12, 2019, 1:08am

I did it, and seems that the problem is not related with pytorch, because this test also fails:

Enabling peer access between GPU0 and GPU1…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.07GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access…
Shutting down…
Test failed!

I suppose that means there is a problem in my hw setup.

albanD · November 12, 2019, 1:22am

Have you already tried to wipe your cuda + nvidia driver install and redo it from scratch? Maybe try that on a bootable drive not to mess with your current install.
If that does not help, maybe remove and put back the GPUs in the machine? They might not be properly installed physically.

Dhorka · November 13, 2019, 9:18am

Thanks, It is still not working I suppose there is a problem in my hw. Thanks for your help.