Normally, it does not print out in which line the error raised after using nn.DataParallel. So, how do you guys to locate the bug in your code when using nn.DataParallel.
It is strange. I did not encounter this kind of things. Can you give the detailed error information.
Traceback (most recent call last):
File “test_fusion.py”, line 139, in
rst = eval_video((i, data, label))
File “test_fusion.py”, line 123, in eval_video
rst = net(input_var)
File “/home/rusu5516/.local/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/home/rusu5516/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 60, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/home/rusu5516/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 70, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/home/rusu5516/.local/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 67, in parallel_apply
RuntimeError: invalid argument 3: sizes do not match at /pytorch/torch/lib/THC/generated/…/generic/THCTensorMathPointwise.cu:217
like this error message. I have no idea in which layer of my model raise this error. but if I do not use DataParallel It would show in which line of my code produce this error
Try disabling parallel and debugging single-treaded first.
I am also using nn.DataParallel and getting expected shape errors. In my case, my model gives OOM errors when I run it on single GPU or CPU. Therefore, I tried to use nn.DataParallel but I cannot debug where the expected shape error is coming from. How to approach in this case?