Size mismatch when running FasterRCNN in parallel

Hi all,

I’m following this tutorial (https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html) and I have everything working fine on a single GPU with a batch size of 4 and a custom dataset (I’m using PyTorch 1.2 and torchvision 0.4 with 2 GPUs)

I’m trying to get it to work either with DataParallel or DistributedDataParallel (as per https://github.com/pytorch/vision/blob/master/references/detection/train.py).

I’m getting this error with DataParallel:

Traceback (most recent call last):
  File "/home/anjum/PycharmProjects/kaggle/open_images_2019/object_detection/baseline.py", line 61, in <module>
    main(None)
  File "/home/anjum/PycharmProjects/kaggle/open_images_2019/object_detection/baseline.py", line 53, in main
    train_one_epoch(model, optimizer, train_loader, device, epoch, print_freq=10)
  File "/home/anjum/PycharmProjects/kaggle/open_images_2019/references/detection/engine.py", line 30, in train_one_epoch
    loss_dict = model(images, targets)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py", line 47, in forward
    images, targets = self.transform(images, targets)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 40, in forward
    image = self.normalize(image)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 55, in normalize
    return (image - mean[:, None, None]) / std[:, None, None]
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 0

Using DistributedDataParallel is giving me a similar error due to a size mismatch. Any ideas what could be going wrong?

After chasing tensors around in the debugger, I found that when the code reaches replicate in DataParallel, inputs has now only got 2 channels instead of 3. :thinking:

It seems as if the data was chunked in a weird way.
Could you print the shape of input and target before passing them to the model?

Thanks for the reply! So I put the following lines at line 29 of engine:

print([i.size() for i in images])
print([{k: v.size()for k, v in t.items()} for t in targets])
print(targets[0])

This is the output:

[torch.Size([3, 768, 1024]), torch.Size([3, 1024, 682]), torch.Size([3, 731, 1024]), torch.Size([3, 678, 1024])]

[{'boxes': torch.Size([8, 4]), 'labels': torch.Size([8]), 'image_id': torch.Size([1]), 'area': torch.Size([8]), 'iscrowd': torch.Size([8])}, 
{'boxes': torch.Size([5, 4]), 'labels': torch.Size([5]), 'image_id': torch.Size([1]), 'area': torch.Size([5]), 'iscrowd': torch.Size([5])}, 
{'boxes': torch.Size([9, 4]), 'labels': torch.Size([9]), 'image_id': torch.Size([1]), 'area': torch.Size([9]), 'iscrowd': torch.Size([9])}, 
{'boxes': torch.Size([13, 4]), 'labels': torch.Size([13]), 'image_id': torch.Size([1]), 'area': torch.Size([13]), 'iscrowd': torch.Size([13])}]

{'boxes': tensor([[0.4550, 0.3592, 0.6219, 0.9908],
        [0.0000, 0.6842, 0.0994, 0.7950],
        [0.0000, 0.8717, 0.1513, 0.9992],
        [0.0000, 0.5833, 0.0512, 0.6867],
        [0.0838, 0.4783, 0.4000, 0.9992],
        [0.7450, 0.4375, 0.7862, 0.5617],
        [0.4737, 0.9617, 0.5519, 0.9992],
        [0.5569, 0.9575, 0.6394, 0.9992]], device='cuda:0'), 'labels': tensor([113, 134, 134, 113, 113, 113,  70,  70], device='cuda:0'), 'image_id': tensor([1198262], device='cuda:0'), 'area': tensor([0.1054, 0.0110, 0.0193, 0.0053, 0.1647, 0.0051, 0.0029, 0.0034],
       device='cuda:0'), 'iscrowd': tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0', dtype=torch.uint8)}

This is how I have my DataLoaders set up:

train_dataset = ObjectDetectionDataset("train", image_transforms)
valid_dataset = ObjectDetectionDataset("valid", image_transforms)

train_loader = DataLoader(train_dataset, batch_size=4, num_workers=NUM_WORKERS, shuffle=True, collate_fn=utils.collate_fn)
valid_loader = DataLoader(valid_dataset, batch_size=4, num_workers=NUM_WORKERS, collate_fn=utils.collate_fn)

Edit: I just realised my bboxs are fractional not pixels (i’ll fix that later)

Edit 2: I tested this with the PennFudanDataset Colab example (run on my local runtime), except with model = torch.nn.parallel.DataParallel(model) after model.to(device) and I’m still getting the same issue

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-19-c750cf7ed566> in <module>
      3 for epoch in range(num_epochs):
      4     # train for one epoch, printing every 10 iterations
----> 5     train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
      6     # update the learning rate
      7     lr_scheduler.step()

~/engine.py in train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq)
     28         targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
     29 
---> 30         loss_dict = model(images, targets)
     31 
     32         losses = sum(loss for loss in loss_dict.values())

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    545             result = self._slow_forward(*input, **kwargs)
    546         else:
--> 547             result = self.forward(*input, **kwargs)
    548         for hook in self._forward_hooks.values():
    549             hook_result = hook(self, input, result)

~/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    150             return self.module(*inputs[0], **kwargs[0])
    151         replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 152         outputs = self.parallel_apply(replicas, inputs, kwargs)
    153         return self.gather(outputs, self.output_device)
    154 

~/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
    160 
    161     def parallel_apply(self, replicas, inputs, kwargs):
--> 162         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    163 
    164     def gather(self, outputs, output_device):

~/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
     83         output = results[i]
     84         if isinstance(output, ExceptionWrapper):
---> 85             output.reraise()
     86         outputs.append(output)
     87     return outputs

~/anaconda3/lib/python3.7/site-packages/torch/_utils.py in reraise(self)
    367             # (https://bugs.python.org/issue2651), so we work around it.
    368             msg = KeyErrorMessage(msg)
--> 369         raise self.exc_type(msg)

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py", line 47, in forward
    images, targets = self.transform(images, targets)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 40, in forward
    image = self.normalize(image)
  File "/home/anjum/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 55, in normalize
    return (image - mean[:, None, None]) / std[:, None, None]
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 0

Hi,

I am having the same issue. Did you manage to solve the problem? Thanks!

Detection models do not seem to support nn.DataParallel, so you would need to use nn.DistributedDataParallel instead (which we also recommend to use anyway).
Have a look at this post for more information.

CC @Anjum_Sayed

Hi, yes I did get it to work in the end.

As @ptrblck mentioned nn.DataParallel doesn’t work in the same way you might be used to when using multiple GPUs, as in these object detection models the replica models are not independent hence why nn.DistributedDataParallel is needed.

Also, for this to work, the script has to be launched in a very specific way so that everything works properly in a distributed manner:
python -m torch.distributed.launch --nproc_per_node=2 --use_env name_of_your_training_script.py

I found this here. There is also a GitHub discussion around some other questions I had. This ensures that each process is spawned correctly and they are able to talk to each other (this is very similar to how TensorFlow distributed works too).

If you are customising the example code (for example using a different backbone), it’s worth reading a lot of the boilerplate code in the example to understand how certain variables are set for torch.distributed.launch to work properly

Additionally, if you want to use mixed precision training you might find this useful

1 Like

Thank you a lot for your help! :smiley: I got it to work finally

Thank you for providing the valuable information! :smiley: I got it to work finally