Only first gpu is allocated (eventhough I make other gpus visible)

The problem is that
eventhough I specified certain gpus that can be shown with os.environ[“CUDA_VISIBLE_DEVICES”],
the program keeps using only first gpu.

(But other program works fine and other specified gpus are allocated well.
because of that, I think it is not nvidia or system problem.
nvidia-smi shows all gpus well and there’s no problem.
I didn’t have problem with allocating gpus with below codes before (except when the system is not working)
it works fine when I use

torch.device("cuda:other_gpu_id" if torch.cuda.is_available() else "cpu")

)

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBILE_DEVICES"] = str(args.gpu)

I wrote that before running main function.
and it works fine for other programs in same system (pytorch 1.8.1 with cuda11.1).

I printed that args.gpu variable, and could see that the value is not “0”.

and I tried below

CUDA_VISIBLE_DEVICES=4,5 python script.py

but still only first gpu is allocated.

when I asked the question in stackoverflow,
I could get the answer that those “os.envrion~” work in tensorflow and keras based code and not for pytorch code

As I didn’t have problem running pytorch code with those specifying-gpu codes before,
I want to know what makes it work and not work.

I’m unsure what the actual issue is. Are you trying to mask specific GPUs and this is not working correctly using the os.environ call? If so, set CUDA_VISIBLE_DEVICES in the terminal (either as part of the actual command or export it).

Or are you making all GPUs visible and cannot use any other device?
In that case, what does x = torch.randn(1, device='cuda:1') return and what does nvidia-smi show?

1 Like

Thank you for answering to the question.

yes the actual issue is first one.
and this is not working correctly using os.environ call.

the program worked well with second case. If I specify available gpus with torch.device(), it worked well and can see it is allocated well when I see nvidia-smi.

I am curious why sometimes os.environ call do not work in pytorch program

I am sorry for changing the word.
when I tried
CUDA_VISIBLE_DEVICES=4 python script.py again
I can see certain gpu is allocated well…
(I saw… setting CUDA_VISIBLE_DEVICES in terminal didn’t work …and couldn’t find the answer here and that’s why I made new question
but now maybe I need to change the question or delete it)

can I ask you why os.environ does not work in some cases?
I put this on top of the code (after importing) before main function and until now it worked well.
after I declared visible devices with os.environ.
[object].cuda() was allocated well with visible devices.

but for this program, it keeps assigning to only first gpu when I use [object].cuda()

If I understand you correctly, setting CUDA_VISIBLE_DEVICES in the terminal works correctly, but doesn’t work if you try to set it inside the script via os.environ.
This is a commonly met issue, if you are not making sure that the env var is set before importing any CUDA related libraries, which is why I’m recommending to set it in the terminal.

1 Like