nn.DataParallel with input as a list not a tensor

Hi everyone,
I’m trying to train the FactorizableNet using multi-GPUs. My GPUs are Titan Xp.
This network can be trained with the batch size equal to the number of GPUs. As you can see in the below code snip, the main input of the model is a list of images of diffirent sizes, not a tensor. The dataloader and the collate function just return a list of tuples.

for i, sample in enumerate(loader): # (im_data, im_info, gt_objects, gt_relationships)
        # measure the data loading time
        batch_size = len(sample['visual'])
        # measure data loading time
        meters['data_time'].update(time.time() - end, n=batch_size)

        input_visual = [item for item in sample['visual']]
        target_objects = sample['objects']
        target_relations = sample['relations']
        image_info = sample['image_info']
        # RPN targets
        rpn_anchor_targets_obj = [[
                np_to_variable(item[0],is_cuda=False, dtype=torch.LongTensor),
                np_to_variable(item[1],is_cuda=False),
                np_to_variable(item[2],is_cuda=False),
                np_to_variable(item[3],is_cuda=False)
                ] for item in sample['rpn_targets']['object']]

        # compute output
        try:
            raw_losses = model(
                        im_data=input_visual, 
                        im_info=image_info,
                        gt_objects=target_objects,
                        gt_relationships=target_relations,
                       rpn_anchor_targets_obj=rpn_anchor_targets_obj)
.....

The problem is that the time to complete a batch with batch size 1 is 1/8 of time for a batch of size 8. This means the training time is not reduced at all. Memory on the The GPU volatile utilization of all GPUs are very low.
I have asked the author, but it seems that he does not have time to answer questions.
What can I do now to reduce training time?
Thank you.

Hi Cao,

With low memory utilization on all GPUs and a batch of 1 per GPU, you should try and increase the batch size per GPU. I took a brief look at the underlying code and it looks like it is explicitly hard coded to 1 per GPU (see this commit). If this can be modified to allow for >1 then you’re likely to see some speedups.

Good luck.

Thank for your reply and sorry for my bad writing as well.
The main point here is processing time for batch size 1 (1 GPUs) is 0.48s, and that for batch 8 (8 GPUs) is 3.7s. There is no parallel processing at all.

I think feeding a list into the model is the main reason, then I would like to get help from you and other Pytorch experts.

It depends what the underlying implementation does. If it wraps nn.DataParallel, you should see a speedup. If it just processes the examples serially then not. When you run this, do you see GPU utilization on all the GPUs you expect to be participating (e.g. with nvidia-smi)?

I used gpustat to view the GPUs utilization every 2 second.
Most of running time, the GPUs utilization is very low. Sometime it jumps to a high value for a moment then go back to a idle.

After carefully inspecting the code, I have found that the author didn’t use the nn.DataParallel but their own DataParallel.
Code for the DataParallel is below:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import DataParallel as DataParallel_raw
import numpy as np


class DataParallel(DataParallel_raw):
    """
    we do the scatter outside of the DataPrallel.
    input: Scattered Inputs without kwargs.
    """

    def __init__(self, module):
        # Disable all the other parameters
        super(DataParallel, self).__init__(module)


    def forward(self, *inputs, **kwargs):
        assert len(inputs) == 0, "Only support arguments like [variable_name = xxx]"
        new_inputs = [{} for _ in self.device_ids]
        for key in kwargs:
            if key == 'im_data':
                for i, device in enumerate(self.device_ids):
                    new_inputs[i][key] = kwargs[key][i].to(device)
            elif key.startswith("rpn_anchor_targets"):
                for i, device in enumerate(self.device_ids):
                    new_inputs[i][key] = [item.to(device) for item in kwargs[key][i]]
                
            else:
                assert isinstance(kwargs[key], list)
                for i in range(len(self.device_ids)):
                    new_inputs[i][key] = [kwargs[key][i], ]
        nones = [[] for _ in self.device_ids]
        replicas = self.replicate(self.module, self.device_ids)
        outputs = self.parallel_apply(replicas, nones, new_inputs)
return self.gather(outputs, self.output_device)

You could add some timing information to those paths. For example, if all time is spent in parallel_apply you know there is something inside the model that’s causing this, instead of this custom nn.DataParallel wrapper. Alternatively, wait for the author to have time to debug this with you.

Thank you very much,
I have just give working on that code.

My solution:

class Model(nn.Module):
    def __init__():
        pass
    
    def forward(inputs, index):
        inputs = inputs[int(index[0]) : int(index[-1]) + 1]
        inputs = [torch.from_numpy(i).to(index.device) for i in inputs]
        ...

model = nn.DataParalle(Model())
index = torch.tensor(list(range(batch_size))).to(device)
x = [ndarray1, ndarray2, ..., ndarray16]
out = model(x, index)