How to debug your code when using dataparallel

ray342659093 · October 26, 2017, 5:15am

Normally, it does not print out in which line the error raised after using nn.DataParallel. So, how do you guys to locate the bug in your code when using nn.DataParallel.

WERush · October 26, 2017, 6:02am

It is strange. I did not encounter this kind of things. Can you give the detailed error information.

ray342659093 · October 26, 2017, 6:30am

Traceback (most recent call last):
File “test_fusion.py”, line 139, in
rst = eval_video((i, data, label))
File “test_fusion.py”, line 123, in eval_video
rst = net(input_var)
File “/home/rusu5516/.local/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/home/rusu5516/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 60, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/home/rusu5516/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 70, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/home/rusu5516/.local/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 67, in parallel_apply
raise output
RuntimeError: invalid argument 3: sizes do not match at /pytorch/torch/lib/THC/generated/…/generic/THCTensorMathPointwise.cu:217

like this error message. I have no idea in which layer of my model raise this error. but if I do not use DataParallel It would show in which line of my code produce this error

Andrei_Pokrovsky · March 9, 2018, 12:49am

Try disabling parallel and debugging single-treaded first.

arushi_019 · May 25, 2020, 12:19pm

I am also using nn.DataParallel and getting expected shape errors. In my case, my model gives OOM errors when I run it on single GPU or CPU. Therefore, I tried to use nn.DataParallel but I cannot debug where the expected shape error is coming from. How to approach in this case?