are you able to initialize a random tensor on GPU1 directly?
I’m able to initialize a random tensor on gpu1:
>>> d = torch.rand(1000,1000, device='cuda:1')
>>> d
tensor([[0.7564, 0.9157, 0.0234,  ..., 0.7413, 0.5786, 0.8648],
        [0.9184, 0.4508, 0.7758,  ..., 0.8677, 0.3734, 0.0650],
        [0.8781, 0.6679, 0.6536,  ..., 0.7493, 0.5669, 0.2126],
        ...,
        [0.6117, 0.8818, 0.2546,  ..., 0.2776, 0.6110, 0.9790],
        [0.8194, 0.1569, 0.4981,  ..., 0.1380, 0.8166, 0.5869],
        [0.6612, 0.3502, 0.1820,  ..., 0.4930, 0.5999, 0.5068]],
       device='cuda:1')
If you are using a larger tensor (let’s say 1 million values), is  c  still all zeros or does it contain random values (including Inf, NaN)?
It is still all zeros:
>>> import torch
>>> a=torch.rand(1000,1000)
>>> b = a.to('cuda:0')
>>> b
tensor([[0.1744, 0.7191, 0.2494,  ..., 0.5766, 0.0246, 0.1605],
        [0.3086, 0.8365, 0.2884,  ..., 0.7067, 0.3236, 0.4698],
        [0.0198, 0.3654, 0.4967,  ..., 0.7522, 0.4083, 0.5647],
        ...,
        [0.9669, 0.7398, 0.4953,  ..., 0.8938, 0.7569, 0.9778],
        [0.9604, 0.5742, 0.6091,  ..., 0.9456, 0.5841, 0.5666],
        [0.8006, 0.1631, 0.5027,  ..., 0.2221, 0.5726, 0.2811]],
       device='cuda:0')
>>> c = b.to('cuda:1')
>>> c
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:1')
I’m also struggling to reproduce this issue and haven’t seen it yet.
How Can I help you to reproduce it?
I’m using ubuntu server 18 and python 3.6…I don’t think this is influencing but…
I’m not versed in these kinds of things, but if i’m not wrong this feature is using nvidia gpudirect, right? Is it possible to do a test outside pytorch to check if this feature is working in my setup?