Every time I used DataParallel to run my code on multiple GPUs, I feel it so difficult to make it work. Recently I came across this error:
all tensor must be on devices[0]
I tried hours to make it work but I couldn’t, please help me out.
What I did:
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "3, 6"
- I moved all
input_tensor
andmodel
tocuda:0
withinput_tensor.to("cuda:0")
andmodel.to("cuda:0")
model = torch.nn.DataParallel(model)
loss.mean().backward()
I tried:
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 6"
-
input_tensor.to("cuda:6")
andmodel.to("cuda:6")
…nothing worked.
What am I doing wrong?
As my server cuda:0
is always busy, I like to run run my code on other gpus like cuda:3
and cuda:6
efficiently. Any help highly appreciated.