DataParallel not splitting the data between the GPUs

Hi Everyone,

I am using 4 GPUs for training a model, which was earlier being trained on single gpu, for leveraging the data parallelism and speeding up the training process. For my code, I have set the batch size as 8, and was expecting that while training on 4 GPUs the data would evenly distribute among the 4gpus as individual batch size of 2. But I find that all the inputs are always placed on GPU 0.

Code for Data Parallelism

print ("GPU Count ", torch.cuda.device_count())
    models["encoder"] = ResnetEncoder(
        num_layers=options.resnet_num_layers, pretrained=True)
    models["depth"] = DepthDecoder(
        models["encoder"].num_ch_enc, scales=range(args.numScales), use_bn=args.use_bn)

    if torch.cuda.device_count() > 1:
        models["encoder"] = torch.nn.DataParallel(models["encoder"])
        models["decoder"] = torch.nn.DataParallel(models["depth"])
    models["encoder"].cuda()
    models["depth"].cuda()

Data Loading Phase

for epoch in range(options.numEpochs):
        data_iterator = tqdm(dataloader)

        optimizer.zero_grad()
        for sampleIndex, inputs in enumerate(data_iterator):
            for k in inputs:
                print (inputs[k].size())
            # predict depth maps, now only apply on the first frame
            outputs = {}
            for seq_index in range(options.numFrames):
                # images are supposed to be normalized to [0, 1] in dataloader
                inputs['image', seq_index] = inputs['image', seq_index].cuda()
                inputs['extrinsic', seq_index].cuda()
                print (inputs['image', seq_index].device)

The inputs is a dictionary which has tensors of following size:

torch.Size([8, 3, 480, 640])
torch.Size([8, 3, 480, 640])
torch.Size([8, 480, 640])
torch.Size([8, 4, 4])
torch.Size([8, 3, 480, 640])
torch.Size([8, 3, 480, 640])
torch.Size([8, 480, 640])
torch.Size([8, 4, 4])
torch.Size([8, 3, 480, 640])
torch.Size([8, 3, 480, 640])
torch.Size([8, 480, 640])
torch.Size([8, 4, 4])
torch.Size([8, 6])

The following print statement print (inputs['image', seq_index].device) always print Cuda: 0. I was expecting it should have printed 1, 2 or 3 too but looks like the data is not getting splitted between all the GPUs. Am I missing something here ?

The data will be split in the forward method of the model (actually before calling it), so you might want to check the shape and device of the data inside the model not in the DataLoader loop.

@ptrblck Thank you! That works absolutely as expected now . I had an added query to my previous question. In my network model as printout out in the below code snippet:

      for seq_index in range(options.numFrames):
                # images are supposed to be normalized to [0, 1] in dataloader
                inputs['image', seq_index] = inputs['image', seq_index].cuda()
                inputs['extrinsic', seq_index].cuda()
                if seq_index == 0:
                    # use color augmented image for training (not to use for loss computing)
                    features = models["encoder"](inputs['image_aug', seq_index].cuda())
                    print ("Type of feature ", len(features))
                    multi_scale_depth_pred = models["depth"](features)

Line features = models["encoder"](inputs['image_aug', seq_index].cuda()) returns a list of size 5 where each of the element of the list itself is a 4-dimensional tensor.(Batch Size x Channel x H xW) Now this features is fed to another model which is model["depth"] Now at this point I had two question:

  1. Since I again wish to split this feature in between my multiple GPUs, Do I need to perform any specific operation on features since features.cuda() is an invalid command. (since features is a list)

  2. The second part is: features which I am receiving from models["encoder"] would come back from multiple GPUs , how do I make sure the features coming from a given gpu goes to same gpu in step multi_scale_depth_pred = models["depth"](features)

Thank you,
Nitin

  1. I’m not sure, if nn.DataParallel works with lists as inputs and I would recommend to check it with a very simple model.

  2. nn.DataParallel will split the input batch in dim0 and send each chunk to the corresponding device. The result will be gathered on the default device again. Passing this tensor into the next nn.DataParallel module will perform the same splitting. The general workflow is explained here in more detail. If you need more control how the data is split etc. you could try to use a manual approach by cloning some logic from the nn.DataParallel implementation.

Thanks @ptrblck for the prompt reply. I actually printed the size of the features in the model["decoder"] forward function but I found that batch_size was not reduced (implying the data not being distributed among multiple GPUs. To be specific features is a list of size 5,and each of its element is a 4 dimensional tensor(Batch Size x Channel x h x w). Point I am getting confused is, each element of the list (which is a tensor) has to shared across multiple gpu(in Dataparallel case), Which
I am not sure , how shall I go about this, since features.cuda() is an obvious invalid statement ( AttributeError: 'list' object has no attribute 'cuda')

I don’t quite understand the “each element of the list (which is a tensor) has to shared across multiple gpu(in Dataparallel case)”.
nn.DataParallel will split the input tensor in dim0 and will send each chunk to a GPU. The elements won’t be duplicated, which implies the “sharing”.
Since the list input is apparently not working, you could create a tensor of this list and make sure that the split dimension is in dim0.