Why is my gpu device specification weird?

lingvisa · October 22, 2021, 6:45am

I have 4 gpu and 3 are being used, but the 3rd one is empty, so I want to use it, as shown below:

Then in my script, I have:
device = torch.device("cuda:3" if torch.cuda.is_available() else "cpu")

However, this gives me a weird error message:
RuntimeError: CUDA error: invalid device ordinal

Then I try gpu cuda:0, 1, or 2, it correctly used the GPU but got an out of memory error. This makes sense because other programs are using them. What I don’t understand is that, why cuda:3 produce that error?

Then I changed to:
evice = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)

And at command line i use:
CUDA_VISIBLE_DEVICES=3 python my_script.py

This works fine, but this doesn’t allow me to use nohup:
nohup CUDA_VISIBLE_DEVICES=3 python my_script.py &

Anyway, why. can’t I specify cuda:3 in the script without using the command-line option?

albanD · October 22, 2021, 2:40pm

Are you sure you don’t have a CUDA_VISIBLE_DEVICE=0,1,2 by default?

lingvisa · October 22, 2021, 4:06pm

Oh, yes, it is in my bash file.

export CUDA_VISIBLE_DEVICES=0,1,2

Does this overwrite the device=torch.device(‘cuda:3’) setting in the script?

albanD · October 22, 2021, 4:17pm

Well, it hides the 4th device. So you cannot use it.

lingvisa · October 26, 2021, 5:41am

A further question on this:

I want to use multiple gpu, so I do:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def get_bert():
    from transformers import BertTokenizer, BertModel
    tokenizer = BertTokenizer.from_pretrained("hfl/chinese-roberta-wwm-ext", model_max_length=128)
    model = BertModel.from_pretrained("hfl/chinese-roberta-wwm-ext")
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        model = torch.nn.DataParallel(model)
        #model = torch.nn.DataParallel(model, device_ids=[0, 1])
    model = model.to(device)

    return tokenizer, model

From nvidia-smi screenshot, you can see that there are 4 GPUs, however, the torch.cuda.device_count()=3, and my program only uses 3 GPUs(0,1,2), while GPU 3 is idle. If I specify:
device_ids=[0,1,2,3]
It says ‘AssertionError: Invalid device id’.

Another question is when using multl-GPUs, is this specification of device right?
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

This seems to only get ‘cuda:0’, but the code above does use 3 GPUs, which I don’t quite understand.

albanD · October 26, 2021, 3:22pm

From nvidia-smi screenshot, you can see that there are 4 GPUs, however, the torch.cuda.device_count()=3,

If you still have the CUDA_VISIBLE_DEVICES that hides the 4th GPU that is expected. Keep in mind that nvidia-smi does NOT respect this env variable and will always show you all the devices.

This seems to only get ‘cuda:0’, but the code above does use 3 GPUs, which I don’t quite understand.

It does get you device 0 by default.
3 GPUs are used because of your DataParallel call where you don’t specify the device_ids and so it uses all visible GPUs.

lingvisa · October 26, 2021, 10:35pm

My fault:

‘If you still have the CUDA_VISIBLE_DEVICES that hides the 4th GPU that is expected.’: I thought I had disabled the setting in the bash, but it’s still on.