Unstable tensor creation when selecting GPU

Hi all,

I am a newbie in Pytorch. When I select GPU by os.environ[“CUDA_VISIBLE_DEVICES”], I got a weird and very annoying error when running this code:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import torch
arr = [[2, 573, 2119, 1, 1, 0, 1441, 1, 2119, 2, 0, 0, 1, 0, 0]]
b = torch.tensor(arr, dtype=torch.int64, device='cuda')
print (b)
assert b.sum().item() > 0

Supposedly, b must be tensor([[ 2, 573, 2119, 1, 1, 0, 1441, 1, 2119, 2, 0, 0, 1, 0, 0]], device=‘cuda:0’). But somehow, sometimes b becomes tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device=‘cuda:0’).

MY CUDA version: Cuda compilation tools, release 8.0, V8.0.44
This is my GPU detail:
image

Thank you so much for your help!!

That sounds weird. Are you seeing this effect only on GPU1?

Yes, it’s normal in GPU0

It looks like it’s always using “cuda:0” nonetheless.

Supposedly, b must be tensor([[ 2, 573, 2119, 1, 1, 0, 1441, 1, 2119, 2, 0, 0, 1, 0, 0]], device=‘cuda:0’). But somehow, sometimes b becomes tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device=‘cuda:0’).

Can you try it without that line (`os.environ[“CUDA_VISIBLE_DEVICES”] = “1”) and see if it works normally then? I have actually never used that except when I was using TensorFlow and I am not sure if this plays well with PyTorch.

You can do either

b = torch.tensor(arr, dtype=torch.int64, device=‘cuda:1’)

or

b = torch.tensor(arr, dtype=torch.int64, device=‘cuda:0’)

depending on which GPU you want to use.

If I don’t choose GPU, everything works fine. The thing is I had to share GPUs with others so I had to choose GPUs to run.

this makes the code keep running without stopping and doesn’t show anything. But device=‘cuda:0’ works normally.

If I don’t choose GPU, everything works fine. The thing is I had to share GPUs with others so I had to choose GPUs to run.

Sure, but I was suggesting to do that via “device” not “os.environ”. It’s a bit weird because os.environ[“CUDA_VISIBLE_DEVICES”] = “1”) is supposed to set the default device to GPU 1 but via “cuda:0” you are using GPU 0. I would just try to only use “device” and not mix it via

os.environ[“CUDA_VISIBLE_DEVICES”] = “1”)

because I suspect that if you use os.environ[“CUDA_VISIBLE_DEVICES”] = “1”), then “cuda:0” will effectively be the first card that is visible, i.e., GPU 1, which can easily become confusing and then lead to issues.

I suggest setting

DEVICE = torch.device("cuda:1")

in your script and then use DEVICE everywhere, like

b = torch.tensor(arr, dtype=torch.int64, device=DEVICE)

and don’t use this

os.environ[“CUDA_VISIBLE_DEVICES”] = “1”

because it’s not necessary then.

I change the device by torch.device(“cuda:1”) but it got such problem. The program stucks at creating tensor b.

I change the device by torch.device(“cuda:1”) but it got such problem. The program stucks at creating tensor b.?

Just to be clear, in this case, you did not use “os.environ[“CUDA_VISIBLE_DEVICES”] = “1”” ?

(Because I think if you use that like, torch.device(“cuda:1”) will be equivalent to torch.device(“cuda:0”), and there is no torch.device(“cuda:1”) then.)

Yes, I don’t use os.environ. It can’t print (b) using “cuda:1”