CUDA_VISIBLE_DEVICE is of no use

I have a 4-titan XP GPU server. When i use os.environ[“CUDA_VISIBLE_DEVICES”] =“0,1” to allocate GPUs for a task in python, I find that only GPU 0 is used. And there is out of memroy problems even GPU 1 is free.
Should I allocate memory to different GPUs myself?

3 Likes

Use CUDA_VISIBLE_DEVICES (not “DEVICE”). You have to set it before you launch the program – you can’t do it from within the program.

4 Likes

My bad, there is a typo in my post. But in my code, when i use
os.environ[“CUDA_VISIBLE_DEVICES”] =“1,2”
, only GPU 1 is used. At least, such a line in Python has its own effect. It can control the use of GPUs.
However, It is supposed to make GPU 1 and 2 available for the task, but the result is that only GPU 1 is available. Even when GPU 1 is out of memory, GPU 2 is not used. Is there any other switches controlling parallel computing between two GPUs?
BTW, another question: Does Pytorch tend to use GPUs one by one or allocate equal memories to GPUs

You can push your data to a specific GPU using .cuda(gpu_id). E.g. you can load a generator network on one GPU and the discriminator to the other.
Another option is to use the DataParallel module.

3 Likes

Do you mean that if I want to use two GPUs at the same time, i have to change my source codes and add parallel-related codes?
Or it would just use one GPU at most if not specified.

Basically it is just one line to use DataParallel:

net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
output = net(input_var)

Just wrap your model with DataParallel and call the returned net on your data.
The device_ids parameter specifies the used GPUs.

7 Likes

It doesn’t work for me. Still, only the first GPU is used. Is there any more tricks?

2 Likes

Does data parallel only support more than batch=1? Actually, I only use batch=1.

You need to use >=#gpu batch set to apply data parallel. As its name suggests, data parallel just pushes computation for different data in a batch to different gpus.

2 Likes

Hi! I’m adding os.environ['CUDA_VISIBLE_DEVICES'] = "2" in my code does not work, the code always select first GPU, however CUDA_VISIBLE_DEVICES=2 python train.py works.
I find that os.environ['CUDA_VISIBLE_DEVICES'] in this code is work. Do you know why?

@MrTuo This is how pytorch 0.4.1 convention works. If you say CUDA_VISIBLE_DEVICES=2, 3. Then for pytorch GPU - 2 is cuda:0 and GPU - 3 is cuda:1. Just check your code is consistent with this convention or not?

3 Likes

I had this same issue where setting CUDA_VISIBLE_DEVICES=2 python train.py works but setting os.environ['CUDA_VISIBLE_DEVICES'] = "2" didn’t. The cause of the issue for me was importing the torch packages before setting os.environ['CUDA_VISIBLE_DEVICES'], moving it to the top of the file before importing torch solved it. Hope this helps.

25 Likes

That’s helpful for me, thanks 3000 times

thank you, it works.

Hey, I have the opposite problem: code is using both of my GPUs by default, no matter what I do. These are different GPU models and I DO NOT want to use them for parallel processing. I’ve tried setting GPU #0 with cuda_visible_devices, tried setting it with torch, moved it to beginning of code, nothing is working.

Just for the record, I am doing deep learning object detection importing arcgis and torch. Everything else seems to work fine now, until I try to test learning rate and it tells me my GPUs are imbalanced and that I should exclude GPU #1. I never wanted GPU #1 to be utilized in the first place.

EDIT: Nevermind, it appears to be working now after I did move it towards beginning of code. I guess I reset the kernel somehow, which made it work. I’m just a rookie :stuck_out_tongue:

When I tried this solution (I have two gpu), it shows an error
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

I’m not sure which solution you are referring to, but the error could be raised, if you manually specify a device inside the model.
Could you post an executable code snippet, which would reproduce the issue, so that we could debug it, please?

Saved 30 hours of my life. Thanks a ton.

It’s also possible to run into this with bad conda environments.

For me

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2" # just use one GPU on big machine
import torch
assert torch.cuda.device_count() == 1

Failed, but it was because my environment was problematic, and only

import torch 
print(torch.cuda.current_device())

actually raised an error.