Error using DataParallel with Multiple GPUS

I am getting the following error when trying to use Multiple GPUs with DataParallel. Please note the implementation works perfectly fine on a single GPU.
Here is the traceback:

Traceback (most recent call last):
File “train.py”, line 247, in
train_loss = train_xe(model, dataloader_train, optim, text_field)
File “train.py”, line 77, in train_xe
out = model(detections, captions)
File “/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 547, in call
result = self.forward(*input, **kwargs)
File “/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 151, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File “/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 156, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File “/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/parallel/replicate.py”, line 111, in replicate
buffer_copies_not_rg = _broadcast_coalesced_reshape(buffers_not_rg, devices, detach=True)
File “/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/parallel/replicate.py”, line 75, in _broadcast_coalesced_reshape
return comm.broadcast_coalesced(tensors, devices)
File “/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/cuda/comm.py”, line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: tensors.size() == order.size() INTERNAL ASSERT FAILED at /pytorch/torch/csrc/utils/tensor_flatten.cpp:66, please report a bug to PyTorch. (reorder_tensors_like at /pytorch/torch/csrc/utils/tensor_flatten.cpp:66)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f71080e7273 in /opt/conda/envs/m2release/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: torch::utils::reorder_tensors_like(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10::ArrayRefat::Tensor) + 0x139f (0x7f710c34c9cf in /opt/conda/envs/m2release/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: torch::cuda::broadcast_coalesced(c10::ArrayRefat::Tensor, c10::ArrayRef, unsigned long) + 0x1d96 (0x7f710c834d76 in /opt/conda/envs/m2release/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: + 0x5f422c (0x7f71528fd22c in /opt/conda/envs/m2release/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x1d3ef4 (0x7f71524dcef4 in /opt/conda/envs/m2release/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #48: __libc_start_main + 0xf0 (0x7f7160f72830 in /lib/x86_64-linux-gnu/libc.so.6)

Thanks :slight_smile:

Looks like DataParallel failed to replicate your model to multiple GPUs. Could you please share a minimum repro?

I am trying to parallelize this network.

After line 182 I just added
model = torch.nn.DataParallel(model)

I have tried Pytorch 1.1.0, 1.2.0, 1.4.0, and 1.5. (from source)