DataParallel and cuda with multiple inputs

Hi,

I am interested in using the cuda primitive and also DataParallel, currently I have implemented my network, it has multiple inputs, here is the train function :

    for i, (input, target) in enumerate(train_loader):
        # measure data loading time
        data_time.update(time.time() - end)

        target = target.cuda(async=True)
        input_25 = torch.autograd.Variable(input[0])
        input_51 = torch.autograd.Variable(input[1])
        input_75 = torch.autograd.Variable(input[2])

        target_var = torch.autograd.Variable(target)

        # compute output
        output = model(patch25=input_25, patch51=input_51, patch75=input_75)

This is actually working if I keep everything on CPU. The first implementation used DataParallel, however the forward function only takes a single input, not a list or dict, as the default implementation of nn.Module.forward(), maybe this is an intended choice.

So I tried to just move the network to the GPU:

# basic_conv() returns a nn.Module
net = basic_conv().cuda()

Then I get this error that I cannot interpret myself :

Traceback (most recent call last):
  File "read_network.py", line 29, in <module>
    net(patch25=in25, patch51=in51, patch75=in75)
  File "/home/ganaye/deps/miniconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 210, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/hdd/code/scripts/simple_conv.py", line 30, in forward
    x_25 = self.conv2d_25_5(x['patch25'])
  File "/home/ganaye/deps/miniconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 210, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ganaye/deps/miniconda3/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 235, in forward
    self.padding, self.dilation, self.groups)
  File "/home/ganaye/deps/miniconda3/lib/python3.5/site-packages/torch/nn/functional.py", line 37, in conv2d
    return f(input, weight, bias) if bias is not None else f(input, weight)
  File "/home/ganaye/deps/miniconda3/lib/python3.5/site-packages/torch/nn/_functions/conv.py", line 33, in forward
    output = self._update_output(input, weight, bias)
  File "/home/ganaye/deps/miniconda3/lib/python3.5/site-packages/torch/nn/_functions/conv.py", line 88, in _update_output
    return self._thnn('update_output', input, weight, bias)
  File "/home/ganaye/deps/miniconda3/lib/python3.5/site-packages/torch/nn/_functions/conv.py", line 147, in _thnn
    return impl[fn_name](self, self._bufs[0], input, weight, *args)
  File "/home/ganaye/deps/miniconda3/lib/python3.5/site-packages/torch/nn/_functions/conv.py", line 225, in call_update_output
    bias, *args)
TypeError: FloatSpatialConvolutionMM_updateOutput received an invalid combination of arguments - got (int, torch.FloatTensor, torch.FloatTensor, torch.cuda.FloatTensor, torch.cuda.FloatTensor, torch.FloatTensor, torch.FloatTensor, int, int, int, int, int, int), but expected (int state, torch.FloatTensor input, torch.FloatTensor output, torch.FloatTensor weight, [torch.FloatTensor bias or None], torch.FloatTensor finput, torch.FloatTensor fgradInput, int kW, int kH, int dW, int dH, int padW, int padH)

It seems the error is coming from here, this is the forward call of my network:

    def forward(self, **x):
        # patch of size 25
        x_25 = self.conv2d_25_5(x['patch25'])
        x_25 = F.max_pool2d(x_25, 2, stride=1, padding=0)

I don’t get why it would work in CPU mode and not in GPU, the only thing I changed is calling cuda() on the network.

Help !!

Thanks :slight_smile:

Ok…

The input Variable were not transferred to the GPU, I guess the DataLoader was doing this implicitly. So the cuda problem is solved, thanks !

Any ideas welcomed to use the DataParallel with multiple inputs !

Modules can take as many parameters as you want, they’re not restricted to a single one. DataLoader never transfers the data to the GPU for you, so you have to do it manually. What’s the problem with the DataLoader with multiple inputs? Your dataset can return more than 2 values per index.

1 Like

Sorry, I was meaning DataParallel instead of DataLoader. I would like to give multiple inputs to the DataParallel. I will probably need to modify my network for each input to be distributed to a DataParallel.

Can you explain this code extracted from the imagenet example :

    if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
        model.features = torch.nn.DataParallel(model.features)
        model.cuda()
    else:
        model = torch.nn.DataParallel(model).cuda()

Is there a specific reason to separate the classifier and the features in the alexnet and vgg models ?

Why not giving the whole model to DataParallel, like in the resnet model ?

Thanks

I think the reason is that data parallel is more efficient for convolutional layers. This is explained in https://arxiv.org/abs/1404.5997

1 Like

@trypag we should support having multiple inputs to DataParallel. I’ve opened an issue for this.

cool, thank you :slight_smile:

I implement the data_parallel with two inputs, but it does not work

def data_parallel2(module, input1, input2, device_ids, output_device=None):
"""Evaluates module(input) in parallel across the GPUs given in device_ids.

This is the functional version of the DataParallel module. 

Args:   
    module: the module to evaluate in parallel
    input: input to the module
    device_ids: GPU ids on which to replicate module
    output_device: GPU location of the output  Use -1 to indicate the CPU.
        (default: device_ids[0])
Returns:
    a Variable containing the result of module(input) located on
    output_device
"""
if not device_ids:
    return module(input1, input2) 

if output_device is None:
    output_device = device_ids[0]

replicas = replicate(module, device_ids)
input1s = scatter(input1, device_ids)
input2s = scatter(input2, device_ids)
replicas = replicas[:len(input1s)]
outputs = parallel_apply2(replicas, input1s, input2s)
return gather(outputs, output_device)

def parallel_apply2(modules, input1s, input2s):
assert len(modules) == len(input1s)
# Fast track
if len(modules) == 1:
return (modules[0](input1s[0], input2s[0]),)

lock = threading.Lock()
results = {}

def _worker(module, input1, input2, results, lock):
    var_input1 = input1
    var_input2 = input2
    while not isinstance(var_input1, Variable):
        var_input1 = var_input1[0]
    while not isinstance(var_input2, Variable):
        var_input2 = var_input2[0]
    try:    
        with torch.cuda.device_of(var_input1):
            output = module(input1, input2) 
        with lock:
            results[input1] = output
    except Exception as e:
        with lock:
            results[input1] = e

threads = [threading.Thread(target=_worker,
                            args=(module, input1, input2, results, lock))
            for module, input1, input2 in zip(modules, input1s, input2s)]

It’s a bug in your implementation. See discussion in this issue.