Then in my script, I have: device = torch.device("cuda:3" if torch.cuda.is_available() else "cpu")
However, this gives me a weird error message: RuntimeError: CUDA error: invalid device ordinal
Then I try gpu cuda:0, 1, or 2, it correctly used the GPU but got an out of memory error. This makes sense because other programs are using them. What I don’t understand is that, why cuda:3 produce that error?
Then I changed to:
evice = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
And at command line i use: CUDA_VISIBLE_DEVICES=3 python my_script.py
This works fine, but this doesn’t allow me to use nohup: nohup CUDA_VISIBLE_DEVICES=3 python my_script.py &
Anyway, why. can’t I specify cuda:3 in the script without using the command-line option?
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def get_bert():
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("hfl/chinese-roberta-wwm-ext", model_max_length=128)
model = BertModel.from_pretrained("hfl/chinese-roberta-wwm-ext")
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
model = torch.nn.DataParallel(model)
#model = torch.nn.DataParallel(model, device_ids=[0, 1])
model = model.to(device)
return tokenizer, model
From nvidia-smi screenshot, you can see that there are 4 GPUs, however, the torch.cuda.device_count()=3, and my program only uses 3 GPUs(0,1,2), while GPU 3 is idle. If I specify: device_ids=[0,1,2,3]
It says ‘AssertionError: Invalid device id’.
Another question is when using multl-GPUs, is this specification of device right? device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
This seems to only get ‘cuda:0’, but the code above does use 3 GPUs, which I don’t quite understand.
From nvidia-smi screenshot, you can see that there are 4 GPUs, however, the torch.cuda.device_count()=3,
If you still have the CUDA_VISIBLE_DEVICES that hides the 4th GPU that is expected. Keep in mind that nvidia-smi does NOT respect this env variable and will always show you all the devices.
This seems to only get ‘cuda:0’, but the code above does use 3 GPUs, which I don’t quite understand.
It does get you device 0 by default.
3 GPUs are used because of your DataParallel call where you don’t specify the device_ids and so it uses all visible GPUs.
‘If you still have the CUDA_VISIBLE_DEVICES that hides the 4th GPU that is expected.’: I thought I had disabled the setting in the bash, but it’s still on.