Hi all, there is a strange error in my multigpu training.
Here is traceback:
"
File “/python/envs/lib/python2.7/site-packages/torch/autograd/variable.py”, line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File “/python/envs/lib/python2.7/site-packages/torch/autograd/init.py”, line 99, in backward
variables, grad_variables, retain_graph)
File “/python/envs/lib/python2.7/site-packages/torch/autograd/function.py”, line 91, in apply
return self._forward_cls.backward(self, *args)
File “/python/envs/lib/python2.7/site-packages/torch/nn/parallel/_functions.py”, line 59, in backward
return (None, None) + Scatter.apply(ctx.input_gpus, ctx.input_sizes, ctx.dim, grad_output)
File “/python/envs/lib/python2.7/site-packages/torch/nn/parallel/_functions.py”, line 74, in forward
outputs = comm.scatter(input, ctx.target_gpus, ctx.chunk_sizes, ctx.dim, streams)
File “/python/envs/lib/python2.7/site-packages/torch/cuda/comm.py”, line 178, in scatter
“expected {})”.format(sum(chunk_sizes), tensor.size(dim))
AssertionError: given chunk sizes don’t sum up to the tensor’s size (sum(chunk_sizes) == 3, but expected 4)
"
I use 4 gpus, so I understand why it said expected 4
, but what strange is this error occurs in the middle of the normal training, that is to say, the training can be conducted normally for a while, then it encounters this error.
What’s more, when I changed the batch size from 16 to 8 (both 4 gpus), there is no error any more.
Thanks for help!