Chunk_sizes assertion error in torch/cuda/comm.py during multigpu training

tong · September 28, 2018, 5:21am

Hi all, there is a strange error in my multigpu training.
Here is traceback:
"
File “/python/envs/lib/python2.7/site-packages/torch/autograd/variable.py”, line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File “/python/envs/lib/python2.7/site-packages/torch/autograd/init.py”, line 99, in backward
variables, grad_variables, retain_graph)
File “/python/envs/lib/python2.7/site-packages/torch/autograd/function.py”, line 91, in apply
return self._forward_cls.backward(self, *args)
File “/python/envs/lib/python2.7/site-packages/torch/nn/parallel/_functions.py”, line 59, in backward
return (None, None) + Scatter.apply(ctx.input_gpus, ctx.input_sizes, ctx.dim, grad_output)
File “/python/envs/lib/python2.7/site-packages/torch/nn/parallel/_functions.py”, line 74, in forward
outputs = comm.scatter(input, ctx.target_gpus, ctx.chunk_sizes, ctx.dim, streams)
File “/python/envs/lib/python2.7/site-packages/torch/cuda/comm.py”, line 178, in scatter
“expected {})”.format(sum(chunk_sizes), tensor.size(dim))
AssertionError: given chunk sizes don’t sum up to the tensor’s size (sum(chunk_sizes) == 3, but expected 4)
"

I use 4 gpus, so I understand why it said expected 4, but what strange is this error occurs in the middle of the normal training, that is to say, the training can be conducted normally for a while, then it encounters this error.
What’s more, when I changed the batch size from 16 to 8 (both 4 gpus), there is no error any more.
Thanks for help!

ptrblck · September 28, 2018, 8:34am

I’m not sure, but this might be related to a last (smaller) last batch.
Could you set drop_last=True in your DataLoader and run it again with a batch size of 16?