Hi all,
I’m following this tutorial (https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html) and I have everything working fine on a single GPU with a batch size of 4 and a custom dataset (I’m using PyTorch 1.2 and torchvision 0.4 with 2 GPUs)
I’m trying to get it to work either with DataParallel
or DistributedDataParallel
(as per https://github.com/pytorch/vision/blob/master/references/detection/train.py).
I’m getting this error with DataParallel
:
Traceback (most recent call last):
File "/home/anjum/PycharmProjects/kaggle/open_images_2019/object_detection/baseline.py", line 61, in <module>
main(None)
File "/home/anjum/PycharmProjects/kaggle/open_images_2019/object_detection/baseline.py", line 53, in main
train_one_epoch(model, optimizer, train_loader, device, epoch, print_freq=10)
File "/home/anjum/PycharmProjects/kaggle/open_images_2019/references/detection/engine.py", line 30, in train_one_epoch
loss_dict = model(images, targets)
File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/anjum/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py", line 47, in forward
images, targets = self.transform(images, targets)
File "/home/anjum/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/anjum/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 40, in forward
image = self.normalize(image)
File "/home/anjum/anaconda3/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 55, in normalize
return (image - mean[:, None, None]) / std[:, None, None]
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 0
Using DistributedDataParallel
is giving me a similar error due to a size mismatch. Any ideas what could be going wrong?