Bug in DataParallel? Only works if the dataset device is cuda:0

InnovArul · November 4, 2018, 4:20am

Oops. Now I see the confusion. It means the user has to keep track of what he is passing for device ids in DataParallel. It’s tricky. It should be documented somewhere.

jatinsha · November 23, 2019, 3:13am

Hi,

When we pass device_ids=[id1, id2, id3, id4] in DataParallel. id1 is set as the master node and everything else as slave nodes. To make any other node as master we can just reorder the list e.g. device_ids=[3,1,0,2] will make device#3 (gpu4) as master. This is documented in the pytorch codebase but definitely should be clarified more to make it easier to follow.

I recommend the following code to manage this-

if torch.cuda.is_available() and len(args.deviceIds) > 0:
    # remove any device which doesn't exists
    args.deviceIds = [int(d) for d in args.deviceIds if 0 <= int(d) < torch.cuda.device_count()] 
    # set args.deviceIds[0] (the master node) as the current device
    torch.cuda.set_device(args.deviceIds[0])
    args.device = torch.device("cuda")
else:
    args.device = torch.device('cpu')

# Retrieve model
model = loadOrCreateModel(...)  
if torch.cuda.is_available() and len(args.deviceIds) > 1:
    model = torch.nn.DataParallel(model, device_ids=args.deviceIds).to(device=device)

rasbt · November 23, 2019, 5:05am

Oh nice, didn’t know that. That’s indeed very useful.

deepconsc · December 24, 2019, 10:28pm

Hey,

Workaround is simple as hell.
The issue occurs when you’re loading the checkpoint (which you’ve already trained in DataParallel) to resume the training.
When executing the script to resume training, we use:

model = torch.load('checkpoint.pth').cuda()

But, if you’ve already trained the model in DataParallel, saved the checkpoint, and you try to resume the training, what you do is:

model = torch.load('checkpoint.pth').cuda()
model = torch.nn.DataParallel(model)

And, that’s where PyTorch throws the error.

What you should do is:

If you’ve already trained the model in DataParallel, and want to resume training, simply use model = torch.load().cuda(), and comment or erase the ‘model = torch.nn.DataParallel()’ line.
It’ll execute for sure.
Use ‘os’ to administrate which devices your script should see like this:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0, 1,2,3"

and set the device to

device = ("cuda" if torch.cuda.is_available() else "cpu" )

Hope it helps!

Jay_Super · April 27, 2020, 1:56am

Hi, I’m not sure if this is the actual solution but mine worked.

Before:

    model = EfficientNet.from_pretrained('efficientnet-b0')
    model = model.cuda()
    criterion = nn.BCEWithLogitsLoss().cuda()
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        model = nn.DataParallel(model)

after:

    model = EfficientNet.from_pretrained('efficientnet-b0')
    
    criterion = nn.BCEWithLogitsLoss().cuda()
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        model = nn.DataParallel(model)
   model = model.cuda()