Is it possible to use the MaskRCNN network on multiple GPUs?
import torch
import torchvision
device = 'cuda:0'
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
model = torch.nn.DataParallel(model)
model.to(device)
model.eval()
x = [torch.rand(3, 300, 400).to(device), torch.rand(3, 500, 400).to(device)]
preds = model(x)
This code give the following error:
Traceback (most recent call last):
File "test_parallel.py", line 15, in <module>
preds = model(imgs)
File "/opt/anaconda3/envs/alp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/anaconda3/envs/alp/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/opt/anaconda3/envs/alp/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/opt/anaconda3/envs/alp/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/opt/anaconda3/envs/alp/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/opt/anaconda3/envs/alp/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/opt/anaconda3/envs/alp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/anaconda3/envs/alp/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py", line 47, in forward
images, targets = self.transform(images, targets)
File "/opt/anaconda3/envs/alp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/anaconda3/envs/alp/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 40, in forward
image = self.normalize(image)
File "/opt/anaconda3/envs/alp/lib/python3.7/site-packages/torchvision/models/detection/transform.py", line 55, in normalize
return (image - mean[:, None, None]) / std[:, None, None]
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 0
But if I comment the model = torch.nn.DataParallel(model)
line it works.
There is an github issue on this subject and it seems to come from the way the DataParallel module works and how MaskRCNN methods are defined (if I have understood correctly).
But would there be a solution without making major changes?
torch 1.2.0
torchvision 0.4.0