Training faster-rcnn on multiple gpus on single node

I am getting started with torch, and trying to understand the object detection support in it using https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

I am having trouble paralleilizing the faster-rcnn model. I get the following error when running train_one_epoch in engine.py,

RuntimeError: chunk expects at least a 1-dimensional tensor (chunk at /pytorch/aten/src/ATen/native/TensorShape.cpp:188)

What am I doing wrong?

I parallelized using the following,

backbone = torchvision.models.resnet50(pretrained=True) # commented out torch.flatten
backbone.avgpool = torch.nn.Identity()
backbone.fc = torch.nn.Identity()
backbone.out_channels = 2048

model = FasterRCNN(backbone=backbone,
rpn_anchor_generator=anchor_generator,
box_roi_pool=roi_align,
num_classes=81)

device = torch.device(“cuda”) if torch.cuda.is_available() else torch.device(“cpu”)
model.to(device)
parallel_model = torch.nn.DataParallel(model) # need to first freeze batch norm

Debugging with pdb seems to suggest that it is trying to scatter “image_id” in the target dictionary. I thought I was applying torch.nn.DataParallel in the wrong place, so looked at https://github.com/pytorch/vision/blob/master/references/detection/train.py, but it seems fine.

I also unsuccessfully tried using torch.nn.parallel.DistributedDataParallel in the following way,
torch.distributed.init_process_group(backend=‘nccl’, world_size=1, rank=0)
parallel_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[0])
python3 -m torch.distributed.launch --nproc_per_node=1 --use_env train_faster_rcnn.py

1 Like

Could you post the shapes of the data you are passing to the model after wrapping it in nn.DataParallel?

On line 27 in “train_one_epoch” in “engine.py”, just after,

“for images, targets in metric_logger.log_every(data_loader, print_freq, header):”

(Pdb) p type(images)
<class ‘tuple’>
(Pdb) p type(targets)
<class ‘tuple’>
(Pdb) p images[0].shape
torch.Size([3, 640, 426])
(Pdb) p images[1].shape
torch.Size([3, 427, 640])
(Pdb) p type(targets[0])
<class ‘dict’>
(Pdb) p targets[0].keys()
dict_keys([‘boxes’, ‘labels’, ‘image_id’, ‘area’, ‘iscrowd’])
(Pdb) p targets[0][‘boxes’]
tensor([[ 2.8600, 0.0000, 426.0000, 640.0000],
[122.5200, 2.1600, 412.2500, 395.6800],
[113.8600, 346.6700, 225.6200, 386.8300]])

I’m not sure, if nn.DataParallel works with tuples (of tensors or dicts).
CC @fmassa who might know, how to properly use data parallel for segmentation models.

Thanks. Not sure how to CC.

In https://github.com/pytorch/vision/blob/master/references/detection/train.py, collate_fn for torch.utils.data.DataLoader seems to be the one in “utils.py”, which returns a tuple. The model in that file gets paralleized with torch.nn.parallel.DistributedDataParallel (I understand its different from torch.nn.DataParallel), and trained with “train_one_epoch”. I tried torch.nn.parallel.DistributedDataParallel in the way mentioned in the last part of the posted question, but was unsuccessful

Ah OK. What kind of error did you get using DDP?

Complete error with torch.nn.DataParallel. I replaced some directory names with “HOME” and “PATH_TO_SCRIPT”.

RuntimeError: chunk expects at least a 1-dimensional tensor (chunk at /pytorch/aten/src/ATen/native/TensorShape.cpp:188)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3c89d6f273 in HOME/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: at::native::chunk(at::Tensor const&, long, long) + 0x2ff (0x7f3c1a3ac0cf in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: at::TypeDefault::chunk(at::Tensor const&, long, long) + 0x9 (0x7f3c1a6d4f99 in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: torch::autograd::VariableType::chunk(at::Tensor const&, long, long) + 0x29a (0x7f3c1bfbfcca in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #4: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef, c10::optional<std::vector<long, std::allocator > > const&, long, c10::optional<std::vector<c10::optionalc10::cuda::CUDAStream, std::allocator<c10::optionalc10::cuda::CUDAStream > > > const&) + 0x3e1 (0x7f3c1ca938d1 in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #5: + 0x5f3d8f (0x7f3c8afb6d8f in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x1d3ef4 (0x7f3c8ab96ef4 in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: /usr/bin/python3.6() [0x50746c]
frame #8: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #9: /usr/bin/python3.6() [0x504e80]
frame #10: /usr/bin/python3.6() [0x506ac3]
frame #11: /usr/bin/python3.6() [0x507330]
frame #12: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #13: /usr/bin/python3.6() [0x504e80]
frame #14: /usr/bin/python3.6() [0x56cbbb]
frame #15: PyObject_Call + 0x3e (0x59fcee in /usr/bin/python3.6)
frame #16: THPFunction_apply(_object*, _object*) + 0x9df (0x7f3c8adb791f in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #17: /usr/bin/python3.6() [0x507217]
frame #18: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #19: /usr/bin/python3.6() [0x5057d7]
frame #20: _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)
frame #21: _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)
frame #22: /usr/bin/python3.6() [0x5c377c]
frame #23: PySequence_Tuple + 0x1fc (0x5a10dc in /usr/bin/python3.6)
frame #24: _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)
frame #25: /usr/bin/python3.6() [0x5057d7]
frame #26: _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)
frame #27: _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)
frame #28: /usr/bin/python3.6() [0x5c377c]
frame #29: PySequence_Tuple + 0x222 (0x5a1102 in /usr/bin/python3.6)
frame #30: _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)
frame #31: /usr/bin/python3.6() [0x5057d7]
frame #32: _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)
frame #33: _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)
frame #34: /usr/bin/python3.6() [0x5c377c]
frame #35: PySequence_Tuple + 0x19b (0x5a107b in /usr/bin/python3.6)
frame #36: _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)
frame #37: /usr/bin/python3.6() [0x5057d7]
frame #38: _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)
frame #39: _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)
frame #40: /usr/bin/python3.6() [0x5c377c]
frame #41: PySequence_Tuple + 0x1fc (0x5a10dc in /usr/bin/python3.6)
frame #42: _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)
frame #43: /usr/bin/python3.6() [0x5057d7]
frame #44: /usr/bin/python3.6() [0x506ac3]
frame #45: /usr/bin/python3.6() [0x507330]
frame #46: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #47: /usr/bin/python3.6() [0x5057d7]
frame #48: /usr/bin/python3.6() [0x506ac3]
frame #49: /usr/bin/python3.6() [0x507330]
frame #50: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #51: /usr/bin/python3.6() [0x504e80]
frame #52: /usr/bin/python3.6() [0x506ac3]
frame #53: /usr/bin/python3.6() [0x507330]
frame #54: _PyEval_EvalFrameDefault + 0x1548 (0x508f48 in /usr/bin/python3.6)
frame #55: /usr/bin/python3.6() [0x5064e4]
frame #56: /usr/bin/python3.6() [0x507330]
frame #57: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #58: /usr/bin/python3.6() [0x504e80]
frame #59: /usr/bin/python3.6() [0x5b9928]
frame #60: PyObject_Call + 0x3e (0x59fcee in /usr/bin/python3.6)
frame #61: _PyEval_EvalFrameDefault + 0x1ab3 (0x5094b3 in /usr/bin/python3.6)
frame #62: /usr/bin/python3.6() [0x504e80]
frame #63: /usr/bin/python3.6() [0x5b96f0]

PATH_TO_SCRIPT/train_faster_rcnn.py(78)()
→ train_one_epoch(parallel_model, optimizer, train_dataloader, device, curr_epoch, 100)

The debugger ends in the module “scatter_gather.py” on line 13 in the nested function “def scatter_map(obj)”: return Scatter.apply(target_gpus, None, dim, obj). At this point, obj is “tensor(533958, device=‘cuda:0’)”. It I step up one level, the code is on line 15 in the same nested function “def scatter_map(obj)”: return list(zip(*map(scatter_map, obj))). Here “obj” is (‘image_id’, tensor(533958, device=‘cuda:0’)). “image_id” is a field in the target dictionary, which seems to indicate that the code is trying to scatter the fields in the target dictionary

My bad. In the target dict, target[“image_id”] needs to be torch.tensor([idx]). I was setting it to torch.tensor(idx).

Good to hear, you’ve found the bug. Is it working now?

Still need to thoroughly test it, but it seems to be working with torch.nn.parallel.DistributedDataParallel. Thanks for your help!