Multi-gpu, RuntimeError: Gather got an input of invalid size: got [1, 230], but expected [1, 231]

Vinsent_Paramananth1 · September 7, 2020, 12:18pm

RuntimeError: Gather got an input of invalid size: got [1, 230], but expected [1, 231]

File "main.py", line 517, in train
    fscore, fscore_epoch = ao.train(output_dir=hps.output_dir)
  File "main.py", line 328, in train
    y, _ = nn.DataParallel(self.model(seq, trg), device_ids=gpus, dim=0)  ## TODO  look how they are training the seq2seq model....
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
    return self.gather(outputs, self.output_device)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/cuda/comm.py", line 165, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [1, 230], but expected [1, 231] (gather at /opt/conda/conda-bld/pytorch_1579022051443/work/torch/csrc/cuda/comm.cpp:231)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f32e0f7d627 in /opt/conda/envs/py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: torch::cuda::gather(c10::ArrayRef<at::Tensor>, long, c10::optional<int>) + 0x2ad (0x7f32ec0b476d in /opt/conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: <unknown function> + 0x9f7904 (0x7f3311def904 in /opt/conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x28c076 (0x7f3311684076 in /opt/conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0xa1f (0x7f3311a6de3f in /opt/conda/envs/py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

Pointing for more explanation @

ptrblck · September 10, 2020, 8:57am

Could you post a minimal executable code snippet to reproduce this issue, please?

vinsentds · September 11, 2020, 2:46am

Hi ,
I will try to come up with some minimal code.
Regards
Vinsent P.