Training on Multiple GPUs

alwynmathew · February 5, 2019, 3:35pm

Every time I used DataParallel to run my code on multiple GPUs, I feel it so difficult to make it work. Recently I came across this error:

all tensor must be on devices[0]

I tried hours to make it work but I couldn’t, please help me out.

What I did:

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "3, 6"
I moved all input_tensor and model to cuda:0 with input_tensor.to("cuda:0") and model.to("cuda:0")
model = torch.nn.DataParallel(model)
loss.mean().backward()

I tried:

What am I doing wrong?

As my server cuda:0 is always busy, I like to run run my code on other gpus like cuda:3 and cuda:6 efficiently. Any help highly appreciated.