Questions about torch.nn.DataParallel

Hello! I run the codes to train a model in one machine multi cards using torch.nn.DataParallel. The codes specify gpu of input data as device 0 when loading data. The codes are written like:

model = torch.nn.DataParallel(model).cuda()
device = args.gpu if args.gpu is not None else 0
for i, sample in enumerate(loader):
        video, audio, index = sample['frames'], sample['audio'], sample['index']
        video = video.cuda(device, non_blocking=True)
        audio = audio.cuda(device, non_blocking=True)
        index = index.cuda(device, non_blocking=True)

        output = model(video, audio, index)

For ease of explantation, I have omitted unimportant details. Now I try to add a text modality without specifying the GPU device. The codes are written as follow:

model = torch.nn.DataParallel(model).cuda()
device = args.gpu if args.gpu is not None else 0
for i, sample in enumerate(loader):
        video, audio, text, index = sample['frames'], sample['audio'], sample['text'], sample['index']
        video = video.cuda(device, non_blocking=True)
        audio = audio.cuda(device, non_blocking=True)
        index = index.cuda(device, non_blocking=True)
 
        text = text.cuda()

        output = model(video, audio, text, index)

I wonder is this correct? Why does data need to be on device 0 when using torch.nn.DataParallel? Thanks.

nn.DataParallel will use the default device (GPU0 if you haven’t changed it) to store the data as well as the model and will then scatter the data and the model to each device (which is why DistributedDataParallel is preferred as it doesn’t create an imbalanced GPU memory usage and also avoids the model replication). The detailed workflow is explained in e.g. this blog post.

Thank you very much! It is really helpfu to me!