Multi GPU error in gather(). ValueError: gather got an input of invalid size: got 16x93x5, but expected 16x170x5

Hi, I’m trying to use multiple GPUs for the first time to train a model for machine reading comprehension and question answering with LSTMs as its key component. The input is padded per batch, meaning that each batch is padded to the same length of the sequence with the maximum length.

The splitting part does not have any problem because of this padding, and I don’t think the gather part should have a problem either since the inputs go through pack_padded_sequence and pad_packed sequence before and after going though an LSTM. Yet, I get this asserted error in the gather function and can’t figure out a way to solve this. Anyone with a similar experience before know how to solve this issue?

I’m using two GPUs with a batch size of 32, as can be inferred from the resultant error message. I’m using PyTorch 0.3.1. Has this been resolved in 0.4.0 by any chance? Haven’t encountered documentation regarding this matter though.


Traceback (most recent call last):
  File "scripts/train_selector.py", line 332, in <module>
    ex.run() 
  File "/home/imago/.conda/envs/mrcqa/lib/python3.5/site-packages/sacred/experiment.py", line 209, in run
    run()
  File "/home/imago/.conda/envs/mrcqa/lib/python3.5/site-packages/sacred/run.py", line 221, in __call__
    self.result = self.main_function(*args)
  File "/home/imago/.conda/envs/mrcqa/lib/python3.5/site-packages/sacred/config/captured_function.py", line 46, in captured_function
    result = wrapped(*args, **kwargs)
  File "scripts/train_selector.py", line 285, in main
    train(epoch, selector_model, optimizer, train_data, args, config)
  File "scripts/train_selector.py", line 83, in train
    output, _ = model(passages[:2], passages[2], queries[:2], queries[2])
  File "/home/imago/.conda/envs/mrcqa/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/imago/.conda/envs/mrcqa/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 74, in forward
    return self.gather(outputs, self.output_device)
  File "/home/imago/.conda/envs/mrcqa/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 86, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/imago/.conda/envs/mrcqa/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py", line 65, in gather
    return gather_map(outputs)
  File "/home/imago/.conda/envs/mrcqa/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py", line 60, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/imago/.conda/envs/mrcqa/lib/python3.5/site-packages/torch/nn/parallel/scatter_gather.py", line 57, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/imago/.conda/envs/mrcqa/lib/python3.5/site-packages/torch/nn/parallel/_functions.py", line 55, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/imago/.conda/envs/mrcqa/lib/python3.5/site-packages/torch/cuda/comm.py", line 217, in gather
    "but expected {}".format(got, expected))
ValueError: gather got an input of invalid size: got 16x93x5, but expected 16x170x5