Training faster-rcnn on multiple gpus on single node

smadan · October 4, 2019, 2:36am

I am getting started with torch, and trying to understand the object detection support in it using https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

I am having trouble paralleilizing the faster-rcnn model. I get the following error when running train_one_epoch in engine.py,

RuntimeError: chunk expects at least a 1-dimensional tensor (chunk at /pytorch/aten/src/ATen/native/TensorShape.cpp:188)

What am I doing wrong?

I parallelized using the following,

backbone = torchvision.models.resnet50(pretrained=True) # commented out torch.flatten
backbone.avgpool = torch.nn.Identity()
backbone.fc = torch.nn.Identity()
backbone.out_channels = 2048

model = FasterRCNN(backbone=backbone,
rpn_anchor_generator=anchor_generator,
box_roi_pool=roi_align,
num_classes=81)

device = torch.device(“cuda”) if torch.cuda.is_available() else torch.device(“cpu”)
model.to(device)
parallel_model = torch.nn.DataParallel(model) # need to first freeze batch norm

Debugging with pdb seems to suggest that it is trying to scatter “image_id” in the target dictionary. I thought I was applying torch.nn.DataParallel in the wrong place, so looked at https://github.com/pytorch/vision/blob/master/references/detection/train.py, but it seems fine.

I also unsuccessfully tried using torch.nn.parallel.DistributedDataParallel in the following way,
torch.distributed.init_process_group(backend=‘nccl’, world_size=1, rank=0)
parallel_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[0])
python3 -m torch.distributed.launch --nproc_per_node=1 --use_env train_faster_rcnn.py

ptrblck · October 4, 2019, 8:22pm

Could you post the shapes of the data you are passing to the model after wrapping it in nn.DataParallel?

smadan · October 4, 2019, 8:58pm

On line 27 in “train_one_epoch” in “engine.py”, just after,

“for images, targets in metric_logger.log_every(data_loader, print_freq, header):”

(Pdb) p type(images)
<class ‘tuple’>
(Pdb) p type(targets)
<class ‘tuple’>
(Pdb) p images[0].shape
torch.Size([3, 640, 426])
(Pdb) p images[1].shape
torch.Size([3, 427, 640])
(Pdb) p type(targets[0])
<class ‘dict’>
(Pdb) p targets[0].keys()
dict_keys([‘boxes’, ‘labels’, ‘image_id’, ‘area’, ‘iscrowd’])
(Pdb) p targets[0][‘boxes’]
tensor([[ 2.8600, 0.0000, 426.0000, 640.0000],
[122.5200, 2.1600, 412.2500, 395.6800],
[113.8600, 346.6700, 225.6200, 386.8300]])

ptrblck · October 4, 2019, 9:15pm

I’m not sure, if nn.DataParallel works with tuples (of tensors or dicts).
CC @fmassa who might know, how to properly use data parallel for segmentation models.

smadan · October 4, 2019, 10:10pm

Thanks. Not sure how to CC.

In https://github.com/pytorch/vision/blob/master/references/detection/train.py, collate_fn for torch.utils.data.DataLoader seems to be the one in “utils.py”, which returns a tuple. The model in that file gets paralleized with torch.nn.parallel.DistributedDataParallel (I understand its different from torch.nn.DataParallel), and trained with “train_one_epoch”. I tried torch.nn.parallel.DistributedDataParallel in the way mentioned in the last part of the posted question, but was unsuccessful

ptrblck · October 4, 2019, 10:48pm

Ah OK. What kind of error did you get using DDP?

smadan · October 4, 2019, 11:24pm

Complete error with torch.nn.DataParallel. I replaced some directory names with “HOME” and “PATH_TO_SCRIPT”.

RuntimeError: chunk expects at least a 1-dimensional tensor (chunk at /pytorch/aten/src/ATen/native/TensorShape.cpp:188)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3c89d6f273 in HOME/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: at::native::chunk(at::Tensor const&, long, long) + 0x2ff (0x7f3c1a3ac0cf in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: at::TypeDefault::chunk(at::Tensor const&, long, long) + 0x9 (0x7f3c1a6d4f99 in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: torch::autograd::VariableType::chunk(at::Tensor const&, long, long) + 0x29a (0x7f3c1bfbfcca in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #4: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef, c10::optional<std::vector<long, std::allocator > > const&, long, c10::optional<std::vector<c10::optionalc10::cuda::CUDAStream, std::allocator<c10::optionalc10::cuda::CUDAStream > > > const&) + 0x3e1 (0x7f3c1ca938d1 in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #5: + 0x5f3d8f (0x7f3c8afb6d8f in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x1d3ef4 (0x7f3c8ab96ef4 in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: /usr/bin/python3.6() [0x50746c]
frame #8: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #9: /usr/bin/python3.6() [0x504e80]
frame #10: /usr/bin/python3.6() [0x506ac3]
frame #11: /usr/bin/python3.6() [0x507330]
frame #12: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #13: /usr/bin/python3.6() [0x504e80]
frame #14: /usr/bin/python3.6() [0x56cbbb]
frame #15: PyObject_Call + 0x3e (0x59fcee in /usr/bin/python3.6)
frame #16: THPFunction_apply(_object*, _object*) + 0x9df (0x7f3c8adb791f in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #17: /usr/bin/python3.6() [0x507217]
frame #18: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #19: /usr/bin/python3.6() [0x5057d7]
frame #20: _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)
frame #21: _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)
frame #22: /usr/bin/python3.6() [0x5c377c]
frame #23: PySequence_Tuple + 0x1fc (0x5a10dc in /usr/bin/python3.6)
frame #24: _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)
frame #25: /usr/bin/python3.6() [0x5057d7]
frame #26: _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)
frame #27: _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)
frame #28: /usr/bin/python3.6() [0x5c377c]
frame #29: PySequence_Tuple + 0x222 (0x5a1102 in /usr/bin/python3.6)
frame #30: _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)
frame #31: /usr/bin/python3.6() [0x5057d7]
frame #32: _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)
frame #33: _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)
frame #34: /usr/bin/python3.6() [0x5c377c]
frame #35: PySequence_Tuple + 0x19b (0x5a107b in /usr/bin/python3.6)
frame #36: _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)
frame #37: /usr/bin/python3.6() [0x5057d7]
frame #38: _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)
frame #39: _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)
frame #40: /usr/bin/python3.6() [0x5c377c]
frame #41: PySequence_Tuple + 0x1fc (0x5a10dc in /usr/bin/python3.6)
frame #42: _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)
frame #43: /usr/bin/python3.6() [0x5057d7]
frame #44: /usr/bin/python3.6() [0x506ac3]
frame #45: /usr/bin/python3.6() [0x507330]
frame #46: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #47: /usr/bin/python3.6() [0x5057d7]
frame #48: /usr/bin/python3.6() [0x506ac3]
frame #49: /usr/bin/python3.6() [0x507330]
frame #50: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #51: /usr/bin/python3.6() [0x504e80]
frame #52: /usr/bin/python3.6() [0x506ac3]
frame #53: /usr/bin/python3.6() [0x507330]
frame #54: _PyEval_EvalFrameDefault + 0x1548 (0x508f48 in /usr/bin/python3.6)
frame #55: /usr/bin/python3.6() [0x5064e4]
frame #56: /usr/bin/python3.6() [0x507330]
frame #57: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #58: /usr/bin/python3.6() [0x504e80]
frame #59: /usr/bin/python3.6() [0x5b9928]
frame #60: PyObject_Call + 0x3e (0x59fcee in /usr/bin/python3.6)
frame #61: _PyEval_EvalFrameDefault + 0x1ab3 (0x5094b3 in /usr/bin/python3.6)
frame #62: /usr/bin/python3.6() [0x504e80]
frame #63: /usr/bin/python3.6() [0x5b96f0]

PATH_TO_SCRIPT/train_faster_rcnn.py(78)()
→ train_one_epoch(parallel_model, optimizer, train_dataloader, device, curr_epoch, 100)

smadan · October 4, 2019, 11:26pm

The debugger ends in the module “scatter_gather.py” on line 13 in the nested function “def scatter_map(obj)”: return Scatter.apply(target_gpus, None, dim, obj). At this point, obj is “tensor(533958, device=‘cuda:0’)”. It I step up one level, the code is on line 15 in the same nested function “def scatter_map(obj)”: return list(zip(*map(scatter_map, obj))). Here “obj” is (‘image_id’, tensor(533958, device=‘cuda:0’)). “image_id” is a field in the target dictionary, which seems to indicate that the code is trying to scatter the fields in the target dictionary

smadan · October 5, 2019, 3:06am

My bad. In the target dict, target[“image_id”] needs to be torch.tensor([idx]). I was setting it to torch.tensor(idx).

ptrblck · October 5, 2019, 10:03am

Good to hear, you’ve found the bug. Is it working now?

smadan · October 5, 2019, 4:08pm

Still need to thoroughly test it, but it seems to be working with torch.nn.parallel.DistributedDataParallel. Thanks for your help!