DataParallel with multiple outputs

My model have multiple outputs, and it works well on one GPU.
But when I using DataParallel, it reports an error:

/usr/local/lib/python2.7/dist-packages/torch/_utils.py:112: UserWarning: src is not broadcastable to dst, but they have the same number of elements.  Falling back to deprecated pointwise behavior.
  flat.narrow(0, offset, numel).copy_(tensor)
Traceback (most recent call last):
  File "baseline.py", line 119, in <module>
    dset_loaders=dset_loaders, dset_classes=dset_classes, val_aug=val_aug, dis_margin=dis_margin)
  File "/home/cv/ensemble/trainer_new.py", line 99, in train_model
    outputs_0, outputs_1, outputs_2, fc3_distance, fc2_distance, fc1_distance, conv5_3_distance, conv4_3_distance, conv3_3_distance, conv2_3_distance, conv1_3_distance, fc3_weight_distance, fc2_weight_distance, fc1_weight_distance, conv5_3_weight_distance, conv4_3_weight_distance, conv3_3_weight_distance, conv2_3_weight_distance, conv1_3_weight_distance = model(inputs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 61, in forward
    return self.gather(outputs, self.output_device)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 73, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/scatter_gather.py", line 50, in gather
    return gather_map(outputs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/scatter_gather.py", line 49, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/scatter_gather.py", line 49, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/scatter_gather.py", line 49, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration

Could anyone help me with it?
Thanks!

3 Likes

Any chance that you can give your model definition to help you figure out the problem?

I encountered the same error and it seems that PyTorch DataParallel doesn’t support models with multiple inputs, right?

My model prototype can be summaried as below:

class Sphereface20(nn.Module):
  def __init__(self, dim=512, num_class=10572):
    super(Sphereface20, self).__init__()
    self.num_class = num_class
    self.base = Resnet20()
    self.fc6 = AngleLinear(dim, num_class, gamma=0.06)
  def forward(self, x, target=None):
    x = self.base(x)
    if self.training:
      x, lamb = self.fc6(x, target)
      return x, lamb
    else:
      return x

The AngleLinear operator accepts both feature and target (label) as input, and outputs x as well as an intermediate variable lamb that is used for monitoring the training procedure.

Everything works well with single GPU training.
But it gives me error as below when I’m trying to use nn.DataParallel for dual GPUs training:

Traceback (most recent call last):
  File "train_sample_exclusive.py", line 228, in <module>
    main()
  File "train_sample_exclusive.py", line 213, in main
    train_epoch(train_loader, model, optimizer, epoch)
  File "train_sample_exclusive.py", line 127, in train_epoch
    output, fc5, lamb = model(data, label)
  File "/home/guest/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/guest/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 115, in forward
    return self.gather(outputs, self.output_device)
  File "/home/guest/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 127, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/guest/.local/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    return gather_map(outputs)
  File "/home/guest/.local/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/guest/.local/lib/python2.7/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration

I can provide more model details as you wish @smth.

DataParallel supports multiple input. The error has to be somewhere else.
Could you explain, what AngleLinear is?
Here is a small example using DataParallel and multiple inputs:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(10, 2)
        self.fc2 = nn.Linear(15, 2)
        
    def forward(self, x1, x2):
        print(x1.device, x2.device)
        x1 = self.fc1(x1)
        x2 = self.fc2(x2)
        x = torch.cat((x1, x2), dim=1)
        return x

device = 'cuda:0'
model = MyModel().to(device)
x1 = torch.randn(2, 10).to(device)
x2 = torch.randn(2, 15).to(device)

output = model(x1, x2)
> (device(type='cuda', index=0), device(type='cuda', index=0))

net = nn.DataParallel(model, device_ids=[0, 1])
output = net(x1, x2)
> (device(type='cuda', index=0), device(type='cuda', index=0))
> (device(type='cuda', index=1), device(type='cuda', index=1))

@ptrblck AngleLinear is describved in this paper which introduces a angular decision margin.

The error should be caused somewhere else and it’s hard to detail the AngleLinear.
I’ll check my code thoroughly and keep updating this thread.

1 Like

Hi zeakey,

Same problem. I am using Dataparallel on this https://github.com/IBM/pytorch-seq2seq/ and face the same situation. Any suggestions?

`TypeError                                 Traceback (most recent call last)
<ipython-input-4-05e939d5ea69> in <module>()
     17         batch_label = batch_label.cuda()
     18 
---> 19         decoder_outputs, decoder_hidden, other = seq2seq_model(batch_data, input_lengths = None, target_variable = batch_label,teacher_forcing_ratio=0.8)
     20 
     21         learning_rate = 1e-4

~/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    489             result = self._slow_forward(*input, **kwargs)
    490         else:
--> 491             result = self.forward(*input, **kwargs)
    492         for hook in self._forward_hooks.values():
    493             hook_result = hook(self, input, result)

~/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    113         replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
    114         outputs = self.parallel_apply(replicas, inputs, kwargs)
--> 115         return self.gather(outputs, self.output_device)
    116 
    117     def replicate(self, module, device_ids):

~/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py in gather(self, outputs, output_device)
    125 
    126     def gather(self, outputs, output_device):
--> 127         return gather(outputs, output_device, dim=self.dim)
    128 
    129 

~/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py in gather(outputs, target_device, dim)
     66     # Setting the function to None clears the refcycle.
     67     try:
---> 68         return gather_map(outputs)
     69     finally:
     70         gather_map = None

~/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py in gather_map(outputs)
     61             return type(out)(((k, gather_map([d[k] for d in outputs]))
     62                               for k in out))
---> 63         return type(out)(map(gather_map, zip(*outputs)))
     64 
     65     # Recursive function calls like this create reference cycles.

~/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py in gather_map(outputs)
     60                 raise ValueError('All dicts must have the same number of keys')
     61             return type(out)(((k, gather_map([d[k] for d in outputs]))
---> 62                               for k in out))
     63         return type(out)(map(gather_map, zip(*outputs)))
     64 

~/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py in <genexpr>(.0)
     60                 raise ValueError('All dicts must have the same number of keys')
     61             return type(out)(((k, gather_map([d[k] for d in outputs]))
---> 62                               for k in out))
     63         return type(out)(map(gather_map, zip(*outputs)))
     64 

~/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py in gather_map(outputs)
     61             return type(out)(((k, gather_map([d[k] for d in outputs]))
     62                               for k in out))
---> 63         return type(out)(map(gather_map, zip(*outputs)))
     64 
     65     # Recursive function calls like this create reference cycles.

~/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py in gather_map(outputs)
     61             return type(out)(((k, gather_map([d[k] for d in outputs]))
     62                               for k in out))
---> 63         return type(out)(map(gather_map, zip(*outputs)))
     64 
     65     # Recursive function calls like this create reference cycles.

TypeError: zip argument #1 must support iteration`

Thanks

2 Likes

same problem here, I have 2 module, when use them with DataParallel, one can forward correctly, but the second one gives this error

did you guys solve the problem? I encountered the same problem too.

I had similar issue. It turned out all the outputs must be the type of tensor.

3 Likes

I had a similar issue but in a multi-gpu system. I resolved it by using CUDA_VISIBLE_DEVICES=0 i.e. using a single gpu, it seems like DataParallel was not able to gather outputs from multiple gpus but it’s fine since I wanted to run it on a single gpu.