smadan
October 4, 2019, 2:36am
1
I am getting started with torch, and trying to understand the object detection support in it using https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
I am having trouble paralleilizing the faster-rcnn model. I get the following error when running train_one_epoch in engine.py,
RuntimeError: chunk expects at least a 1-dimensional tensor (chunk at /pytorch/aten/src/ATen/native/TensorShape.cpp:188)
What am I doing wrong?
I parallelized using the following,
backbone = torchvision.models.resnet50(pretrained=True) # commented out torch.flatten
backbone.avgpool = torch.nn.Identity()
backbone.fc = torch.nn.Identity()
backbone.out_channels = 2048
model = FasterRCNN(backbone=backbone,
rpn_anchor_generator=anchor_generator,
box_roi_pool=roi_align,
num_classes=81)
device = torch.device(“cuda”) if torch.cuda.is_available() else torch.device(“cpu”)
model.to(device)
parallel_model = torch.nn.DataParallel(model) # need to first freeze batch norm
Debugging with pdb seems to suggest that it is trying to scatter “image_id” in the target dictionary. I thought I was applying torch.nn.DataParallel in the wrong place, so looked at https://github.com/pytorch/vision/blob/master/references/detection/train.py , but it seems fine.
I also unsuccessfully tried using torch.nn.parallel.DistributedDataParallel in the following way,
torch.distributed.init_process_group(backend=‘nccl’, world_size=1, rank=0)
parallel_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[0])
python3 -m torch.distributed.launch --nproc_per_node=1 --use_env train_faster_rcnn.py
1 Like
Could you post the shapes of the data you are passing to the model after wrapping it in nn.DataParallel
?
smadan
October 4, 2019, 8:58pm
3
On line 27 in “train_one_epoch” in “engine.py”, just after,
“for images, targets in metric_logger.log_every(data_loader, print_freq, header):”
(Pdb) p type(images)
<class ‘tuple’>
(Pdb) p type(targets)
<class ‘tuple’>
(Pdb) p images[0].shape
torch.Size([3, 640, 426])
(Pdb) p images[1].shape
torch.Size([3, 427, 640])
(Pdb) p type(targets[0])
<class ‘dict’>
(Pdb) p targets[0].keys()
dict_keys([‘boxes’, ‘labels’, ‘image_id’, ‘area’, ‘iscrowd’])
(Pdb) p targets[0][‘boxes’]
tensor([[ 2.8600, 0.0000, 426.0000, 640.0000],
[122.5200, 2.1600, 412.2500, 395.6800],
[113.8600, 346.6700, 225.6200, 386.8300]])
I’m not sure, if nn.DataParallel
works with tuples (of tensors or dicts).
CC @fmassa who might know, how to properly use data parallel for segmentation models.
smadan
October 4, 2019, 10:10pm
5
Thanks. Not sure how to CC.
In https://github.com/pytorch/vision/blob/master/references/detection/train.py , collate_fn for torch.utils.data.DataLoader seems to be the one in “utils.py”, which returns a tuple. The model in that file gets paralleized with torch.nn.parallel.DistributedDataParallel (I understand its different from torch.nn.DataParallel), and trained with “train_one_epoch”. I tried torch.nn.parallel.DistributedDataParallel in the way mentioned in the last part of the posted question, but was unsuccessful
Ah OK. What kind of error did you get using DDP?
smadan
October 4, 2019, 11:24pm
7
Complete error with torch.nn.DataParallel. I replaced some directory names with “HOME” and “PATH_TO_SCRIPT”.
RuntimeError: chunk expects at least a 1-dimensional tensor (chunk at /pytorch/aten/src/ATen/native/TensorShape.cpp:188)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3c89d6f273 in HOME/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: at::native::chunk(at::Tensor const&, long, long) + 0x2ff (0x7f3c1a3ac0cf in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: at::TypeDefault::chunk(at::Tensor const&, long, long) + 0x9 (0x7f3c1a6d4f99 in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: torch::autograd::VariableType::chunk(at::Tensor const&, long, long) + 0x29a (0x7f3c1bfbfcca in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #4: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef, c10::optional<std::vector<long, std::allocator > > const&, long, c10::optional<std::vector<c10::optionalc10::cuda::CUDAStream , std::allocator<c10::optionalc10::cuda::CUDAStream > > > const&) + 0x3e1 (0x7f3c1ca938d1 in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #5: + 0x5f3d8f (0x7f3c8afb6d8f in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x1d3ef4 (0x7f3c8ab96ef4 in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: /usr/bin/python3.6() [0x50746c]
frame #8: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #9: /usr/bin/python3.6() [0x504e80]
frame #10: /usr/bin/python3.6() [0x506ac3]
frame #11: /usr/bin/python3.6() [0x507330]
frame #12: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #13: /usr/bin/python3.6() [0x504e80]
frame #14: /usr/bin/python3.6() [0x56cbbb]
frame #15: PyObject_Call + 0x3e (0x59fcee in /usr/bin/python3.6)
frame #16: THPFunction_apply(_object*, _object*) + 0x9df (0x7f3c8adb791f in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #17: /usr/bin/python3.6() [0x507217]
frame #18: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #19: /usr/bin/python3.6() [0x5057d7]
frame #20: _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)
frame #21: _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)
frame #22: /usr/bin/python3.6() [0x5c377c]
frame #23: PySequence_Tuple + 0x1fc (0x5a10dc in /usr/bin/python3.6)
frame #24: _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)
frame #25: /usr/bin/python3.6() [0x5057d7]
frame #26: _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)
frame #27: _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)
frame #28: /usr/bin/python3.6() [0x5c377c]
frame #29: PySequence_Tuple + 0x222 (0x5a1102 in /usr/bin/python3.6)
frame #30: _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)
frame #31: /usr/bin/python3.6() [0x5057d7]
frame #32: _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)
frame #33: _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)
frame #34: /usr/bin/python3.6() [0x5c377c]
frame #35: PySequence_Tuple + 0x19b (0x5a107b in /usr/bin/python3.6)
frame #36: _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)
frame #37: /usr/bin/python3.6() [0x5057d7]
frame #38: _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)
frame #39: _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)
frame #40: /usr/bin/python3.6() [0x5c377c]
frame #41: PySequence_Tuple + 0x1fc (0x5a10dc in /usr/bin/python3.6)
frame #42: _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)
frame #43: /usr/bin/python3.6() [0x5057d7]
frame #44: /usr/bin/python3.6() [0x506ac3]
frame #45: /usr/bin/python3.6() [0x507330]
frame #46: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #47: /usr/bin/python3.6() [0x5057d7]
frame #48: /usr/bin/python3.6() [0x506ac3]
frame #49: /usr/bin/python3.6() [0x507330]
frame #50: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #51: /usr/bin/python3.6() [0x504e80]
frame #52: /usr/bin/python3.6() [0x506ac3]
frame #53: /usr/bin/python3.6() [0x507330]
frame #54: _PyEval_EvalFrameDefault + 0x1548 (0x508f48 in /usr/bin/python3.6)
frame #55: /usr/bin/python3.6() [0x5064e4]
frame #56: /usr/bin/python3.6() [0x507330]
frame #57: _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)
frame #58: /usr/bin/python3.6() [0x504e80]
frame #59: /usr/bin/python3.6() [0x5b9928]
frame #60: PyObject_Call + 0x3e (0x59fcee in /usr/bin/python3.6)
frame #61: _PyEval_EvalFrameDefault + 0x1ab3 (0x5094b3 in /usr/bin/python3.6)
frame #62: /usr/bin/python3.6() [0x504e80]
frame #63: /usr/bin/python3.6() [0x5b96f0]
PATH_TO_SCRIPT/train_faster_rcnn.py(78)()
→ train_one_epoch(parallel_model, optimizer, train_dataloader, device, curr_epoch, 100)
smadan
October 4, 2019, 11:26pm
8
The debugger ends in the module “scatter_gather.py” on line 13 in the nested function “def scatter_map(obj)”: return Scatter.apply(target_gpus, None, dim, obj). At this point, obj is “tensor(533958, device=‘cuda:0’)”. It I step up one level, the code is on line 15 in the same nested function “def scatter_map(obj)”: return list(zip(*map(scatter_map, obj))). Here “obj” is (‘image_id’, tensor(533958, device=‘cuda:0’)). “image_id” is a field in the target dictionary, which seems to indicate that the code is trying to scatter the fields in the target dictionary
smadan
October 5, 2019, 3:06am
9
My bad. In the target dict, target[“image_id”] needs to be torch.tensor([idx]). I was setting it to torch.tensor(idx).
ptrblck
October 5, 2019, 10:03am
10
Good to hear, you’ve found the bug. Is it working now?
smadan
October 5, 2019, 4:08pm
11
Still need to thoroughly test it, but it seems to be working with torch.nn.parallel.DistributedDataParallel. Thanks for your help!