smadan  
                
                  
                    October 4, 2019,  2:36am
                   
                  1 
               
             
            
              I am getting started with torch, and trying to understand the object detection support in it using https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html 
I am having trouble paralleilizing the faster-rcnn model. I get the following error when running train_one_epoch in engine.py,
RuntimeError: chunk expects at least a 1-dimensional tensor (chunk at /pytorch/aten/src/ATen/native/TensorShape.cpp:188)
What am I doing wrong?
I parallelized using the following,
backbone = torchvision.models.resnet50(pretrained=True)   # commented out torch.flatten
model = FasterRCNN(backbone=backbone,
device = torch.device(“cuda”) if torch.cuda.is_available() else torch.device(“cpu”)
Debugging with pdb seems to suggest that it is trying to scatter “image_id” in the target dictionary. I thought I was applying torch.nn.DataParallel in the wrong place, so looked at https://github.com/pytorch/vision/blob/master/references/detection/train.py , but it seems fine.
I also unsuccessfully tried using torch.nn.parallel.DistributedDataParallel in the following way,
             
            
              1 Like 
            
            
           
          
            
            
              Could you post the shapes of the data you are passing to the model after wrapping it in nn.DataParallel?
             
            
              
            
           
          
            
              
                smadan  
              
                  
                    October 4, 2019,  8:58pm
                   
                  3 
               
             
            
              On line 27 in “train_one_epoch” in “engine.py”, just after,
“for images, targets in metric_logger.log_every(data_loader, print_freq, header):”
(Pdb) p type(images)
             
            
              
            
           
          
            
            
              I’m not sure, if nn.DataParallel works with tuples (of tensors or dicts).@fmassa  who might know, how to properly use data parallel for segmentation models.
             
            
              
            
           
          
            
              
                smadan  
              
                  
                    October 4, 2019, 10:10pm
                   
                  5 
               
             
            
              Thanks. Not sure how to CC.
In https://github.com/pytorch/vision/blob/master/references/detection/train.py , collate_fn for torch.utils.data.DataLoader seems to be the one in “utils.py”, which returns a tuple. The model in that file gets paralleized with torch.nn.parallel.DistributedDataParallel (I understand its different from torch.nn.DataParallel), and trained with “train_one_epoch”. I tried torch.nn.parallel.DistributedDataParallel in the way mentioned in the last part of the posted question, but was unsuccessful
             
            
              
            
           
          
            
            
              Ah OK. What kind of error did you get using DDP?
             
            
              
            
           
          
            
              
                smadan  
              
                  
                    October 4, 2019, 11:24pm
                   
                  7 
               
             
            
              Complete error with torch.nn.DataParallel. I replaced some directory names with “HOME” and “PATH_TO_SCRIPT”.
RuntimeError: chunk expects at least a 1-dimensional tensor (chunk at /pytorch/aten/src/ATen/native/TensorShape.cpp:188)#0:  c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3c89d6f273 in HOME/.local/lib/python3.6/site-packages/torch/lib/libc10.so)#1:  at::native::chunk(at::Tensor const&, long, long) + 0x2ff (0x7f3c1a3ac0cf in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)#2:  at::TypeDefault::chunk(at::Tensor const&, long, long) + 0x9 (0x7f3c1a6d4f99 in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)#3:  torch::autograd::VariableType::chunk(at::Tensor const&, long, long) + 0x29a (0x7f3c1bfbfcca in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)#4:  torch::cuda::scatter(at::Tensor const&, c10::ArrayRef, c10::optional<std::vector<long, std::allocator > > const&, long, c10::optional<std::vector<c10::optionalc10::cuda::CUDAStream , std::allocator<c10::optionalc10::cuda::CUDAStream  > > > const&) + 0x3e1 (0x7f3c1ca938d1 in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch.so)#5:   + 0x5f3d8f (0x7f3c8afb6d8f in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)#6:   + 0x1d3ef4 (0x7f3c8ab96ef4 in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)#7:  /usr/bin/python3.6() [0x50746c]#8:  _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)#9:  /usr/bin/python3.6() [0x504e80]#10:  /usr/bin/python3.6() [0x506ac3]#11:  /usr/bin/python3.6() [0x507330]#12:  _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)#13:  /usr/bin/python3.6() [0x504e80]#14:  /usr/bin/python3.6() [0x56cbbb]#15:  PyObject_Call + 0x3e (0x59fcee in /usr/bin/python3.6)#16:  THPFunction_apply(_object*, _object*) + 0x9df (0x7f3c8adb791f in HOME/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)#17:  /usr/bin/python3.6() [0x507217]#18:  _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)#19:  /usr/bin/python3.6() [0x5057d7]#20:  _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)#21:  _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)#22:  /usr/bin/python3.6() [0x5c377c]#23:  PySequence_Tuple + 0x1fc (0x5a10dc in /usr/bin/python3.6)#24:  _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)#25:  /usr/bin/python3.6() [0x5057d7]#26:  _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)#27:  _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)#28:  /usr/bin/python3.6() [0x5c377c]#29:  PySequence_Tuple + 0x222 (0x5a1102 in /usr/bin/python3.6)#30:  _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)#31:  /usr/bin/python3.6() [0x5057d7]#32:  _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)#33:  _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)#34:  /usr/bin/python3.6() [0x5c377c]#35:  PySequence_Tuple + 0x19b (0x5a107b in /usr/bin/python3.6)#36:  _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)#37:  /usr/bin/python3.6() [0x5057d7]#38:  _PyFunction_FastCallDict + 0xe8 (0x4531d8 in /usr/bin/python3.6)#39:  _PyObject_FastCallDict + 0x291 (0x59f051 in /usr/bin/python3.6)#40:  /usr/bin/python3.6() [0x5c377c]#41:  PySequence_Tuple + 0x1fc (0x5a10dc in /usr/bin/python3.6)#42:  _PyEval_EvalFrameDefault + 0x5d41 (0x50d741 in /usr/bin/python3.6)#43:  /usr/bin/python3.6() [0x5057d7]#44:  /usr/bin/python3.6() [0x506ac3]#45:  /usr/bin/python3.6() [0x507330]#46:  _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)#47:  /usr/bin/python3.6() [0x5057d7]#48:  /usr/bin/python3.6() [0x506ac3]#49:  /usr/bin/python3.6() [0x507330]#50:  _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)#51:  /usr/bin/python3.6() [0x504e80]#52:  /usr/bin/python3.6() [0x506ac3]#53:  /usr/bin/python3.6() [0x507330]#54:  _PyEval_EvalFrameDefault + 0x1548 (0x508f48 in /usr/bin/python3.6)#55:  /usr/bin/python3.6() [0x5064e4]#56:  /usr/bin/python3.6() [0x507330]#57:  _PyEval_EvalFrameDefault + 0x4dd (0x507edd in /usr/bin/python3.6)#58:  /usr/bin/python3.6() [0x504e80]#59:  /usr/bin/python3.6() [0x5b9928]#60:  PyObject_Call + 0x3e (0x59fcee in /usr/bin/python3.6)#61:  _PyEval_EvalFrameDefault + 0x1ab3 (0x5094b3 in /usr/bin/python3.6)#62:  /usr/bin/python3.6() [0x504e80]#63:  /usr/bin/python3.6() [0x5b96f0]
PATH_TO_SCRIPT/train_faster_rcnn.py(78)()
 
             
            
              
            
           
          
            
              
                smadan  
              
                  
                    October 4, 2019, 11:26pm
                   
                  8 
               
             
            
              The debugger ends in the module “scatter_gather.py” on line 13 in the nested function “def scatter_map(obj)”: return Scatter.apply(target_gpus, None, dim, obj). At this point, obj is “tensor(533958, device=‘cuda:0’)”. It I step up one level, the code is on line 15 in the same nested function “def scatter_map(obj)”: return list(zip(*map(scatter_map, obj))). Here “obj” is (‘image_id’, tensor(533958, device=‘cuda:0’)). “image_id” is a field in the target dictionary, which seems to indicate that the code is trying to scatter the fields in the target dictionary
             
            
              
            
           
          
            
              
                smadan  
              
                  
                    October 5, 2019,  3:06am
                   
                  9 
               
             
            
              My bad. In the target dict, target[“image_id”] needs to  be torch.tensor([idx]). I was setting it to torch.tensor(idx).
             
            
              
            
           
          
            
              
                ptrblck  
              
                  
                    October 5, 2019, 10:03am
                   
                  10 
               
             
            
              Good to hear, you’ve found the bug. Is it working now?
             
            
              
            
           
          
            
              
                smadan  
              
                  
                    October 5, 2019,  4:08pm
                   
                  11 
               
             
            
              Still need to thoroughly test it, but it seems to be working with torch.nn.parallel.DistributedDataParallel. Thanks for your help!