Unstable tensor creation when selecting GPU

gofortargets · November 23, 2018, 3:04pm

Hi all,

I am a newbie in Pytorch. When I select GPU by os.environ[“CUDA_VISIBLE_DEVICES”], I got a weird and very annoying error when running this code:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import torch
arr = [[2, 573, 2119, 1, 1, 0, 1441, 1, 2119, 2, 0, 0, 1, 0, 0]]
b = torch.tensor(arr, dtype=torch.int64, device='cuda')
print (b)
assert b.sum().item() > 0

Supposedly, b must be tensor([[ 2, 573, 2119, 1, 1, 0, 1441, 1, 2119, 2, 0, 0, 1, 0, 0]], device=‘cuda:0’). But somehow, sometimes b becomes tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device=‘cuda:0’).

MY CUDA version: Cuda compilation tools, release 8.0, V8.0.44
This is my GPU detail:

Thank you so much for your help!!

ptrblck · November 23, 2018, 3:19pm

That sounds weird. Are you seeing this effect only on GPU1?

gofortargets · November 24, 2018, 2:53am

Yes, it’s normal in GPU0

rasbt · November 24, 2018, 4:46am

It looks like it’s always using “cuda:0” nonetheless.

Supposedly, b must be tensor([[ 2, 573, 2119, 1, 1, 0, 1441, 1, 2119, 2, 0, 0, 1, 0, 0]], device=‘cuda:0’). But somehow, sometimes b becomes tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device=‘cuda:0’).

Can you try it without that line (`os.environ[“CUDA_VISIBLE_DEVICES”] = “1”) and see if it works normally then? I have actually never used that except when I was using TensorFlow and I am not sure if this plays well with PyTorch.

You can do either

b = torch.tensor(arr, dtype=torch.int64, device=‘cuda:1’)

or

b = torch.tensor(arr, dtype=torch.int64, device=‘cuda:0’)

depending on which GPU you want to use.

gofortargets · November 24, 2018, 7:38am

If I don’t choose GPU, everything works fine. The thing is I had to share GPUs with others so I had to choose GPUs to run.

this makes the code keep running without stopping and doesn’t show anything. But device=‘cuda:0’ works normally.

rasbt · November 24, 2018, 7:55am

If I don’t choose GPU, everything works fine. The thing is I had to share GPUs with others so I had to choose GPUs to run.

Sure, but I was suggesting to do that via “device” not “os.environ”. It’s a bit weird because os.environ[“CUDA_VISIBLE_DEVICES”] = “1”) is supposed to set the default device to GPU 1 but via “cuda:0” you are using GPU 0. I would just try to only use “device” and not mix it via

os.environ[“CUDA_VISIBLE_DEVICES”] = “1”)

because I suspect that if you use os.environ[“CUDA_VISIBLE_DEVICES”] = “1”), then “cuda:0” will effectively be the first card that is visible, i.e., GPU 1, which can easily become confusing and then lead to issues.

I suggest setting

DEVICE = torch.device("cuda:1")

in your script and then use DEVICE everywhere, like

b = torch.tensor(arr, dtype=torch.int64, device=DEVICE)

and don’t use this

os.environ[“CUDA_VISIBLE_DEVICES”] = “1”

because it’s not necessary then.

gofortargets · November 24, 2018, 8:35am

I change the device by torch.device(“cuda:1”) but it got such problem. The program stucks at creating tensor b.

rasbt · November 24, 2018, 7:58pm

I change the device by torch.device(“cuda:1”) but it got such problem. The program stucks at creating tensor b.?

Just to be clear, in this case, you did not use “os.environ[“CUDA_VISIBLE_DEVICES”] = “1”” ?

(Because I think if you use that like, torch.device(“cuda:1”) will be equivalent to torch.device(“cuda:0”), and there is no torch.device(“cuda:1”) then.)

gofortargets · November 25, 2018, 9:06am

Yes, I don’t use os.environ. It can’t print (b) using “cuda:1”