Running model on multiple GPUs RuntimeError: Caught RuntimeError in replica 0 on device 0

Hello, I’m trying to train my model on multiple GPUs. I got RuntimeError.

Here the code snippets of defintion of zoo model:

class BasicStem(nn.Module):
	def __init__(self, in_channels=3, out_channels=64, norm="BN"):
		self.conv1 = nn.Conv2d(

	def forward(self, x):
		images = preprocess_image(x)
		print("x", images.tensor.size())
		x = self.conv1(images.tensor)
		x = F.relu_(x)
		x = F.max_pool2d(x, kernel_size=3, stride=2, padding=1)
		return x

def preprocess_image(batched_inputs):
	Normalize, pad and batch the input images.
	device = "cuda"
	size_divisibility = 0
	images = [x["image"].to(device) for x in batched_inputs]
	image_scales = [x["im_scale"] for x in batched_inputs]
	images = ImageList.from_tensors(images, size_divisibility)
	images.image_scales = image_scales
	return images

if __name__ == "__main__":
	cache_file = "data/dummy_data.pkl"
	if os.path.exists(cache_file):
		fid =, 'rb')
		blobs = cPickle.load(fid)
	data = [blobs, blobs, blobs]
	epoch = 10
	model = BasicStem().cuda()
	if torch.cuda.device_count() > 1:
		model = torch.nn.DataParallel(model)
	for i in range(epoch):

It works correctly if I run this code on single GPU. Debug Info on single GPU:

x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])

After launching code snippets on 2 GPUs I got Runtime Error as follows:

ser-v8:19723:19723 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
ser-v8:19723:19723 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via P2P/direct pointer
ser-v8:19723:19723 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/direct pointer
ser-v8:19723:19723 [1] NCCL INFO Ring 01 : 1[1] -> 0[0] via P2P/direct pointer
ser-v8:19723:19723 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/direct pointer
ser-v8:19723:19723 [1] NCCL INFO Ring 02 : 1[1] -> 0[0] via P2P/direct pointer
ser-v8:19723:19723 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/direct pointer
ser-v8:19723:19723 [1] NCCL INFO Ring 03 : 1[1] -> 0[0] via P2P/direct pointer
ser-v8:19723:19723 [0] NCCL INFO Launch mode Group/CGMD
x torch.Size([3, 2, 600, 800])
x torch.Size([3, 1, 600, 800])
Traceback (most recent call last):
  File "tools/", line 83, in <module>
  File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/parallel/", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/parallel/", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/parallel/", line 85, in parallel_apply
  File "/zoo_dummy/lib/python3.6/site-packages/torch/", line 394, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/parallel/", line 60, in _worker
    output = module(*input, **kwargs)
  File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/modules/", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "tools/", line 50, in forward
    x = self.conv1(images.tensor)
  File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/modules/", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/modules/", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/modules/", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size 64 3 7 7, expected input[3, 2, 600, 800] to have 3 channels, but got 2 channels instead

I also had a look at

My data has been organized in NCHW format, batche size = 3, channel size = 3. As I understand correctly, as shows on The data should be scattered on multiple devices by slicing the first dimension, here the data should be split from a batch, say what expected is like:

x torch.Size([1, 3, 600, 800])
x torch.Size([2, 3, 600, 800])

But instead of it, the data channels are sliced unexpected, as shown:

x torch.Size([3, 2, 600, 800])
x torch.Size([3, 1, 600, 800])

Results are tested under following settings:

torch           1.4.0+cu100
torchvision     0.5.0+cu100
NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0
Tesla V100-SXM2
Ubuntu 18.04

I also tried to increase the batch size. However, this does not change anything.

x torch.Size([8, 2, 600, 800])
x torch.Size([8, 1, 600, 800])

Any inputs will be appreciated.

I got this error solved.
The problem is the batched data is not tensor data. It’s a list includes dict based data ( training samples + ground truth). If the input data to model is tensor data organized in NCHW mode, it works as expected.

I’m still wondering that if it’s possible to pass a list of dict objects to a model inherited from DataParallel.
The list batch data can be somehow automatically scattered appropriately on multiple GPUs?


Hi. I am also creating a list of batch which include dict.
I am trying to use nn.DataParallel but also got error, how did you solve the problem?

RuntimeError: Caught RuntimeError in replica 0 on device 0.

Hi @Yangmin, I’d like suggest you to have look at this class Instance in Detectron and the Datasampler .

The former one also includes dictionaries in a list as input of the reference model.
Basically, I guess you are missing the part of parallelizing your data on multiple devices. You will get some ideas from those two classes.

your_dataloader(dataset, distributed, epoch) function should look like as follows:

sampler = samplers.distributed.DistributedSampler(dataset=dataset, epoch=epoch) if distributed else None
shuffle = cfg.DATA_LOADER.SHUFFLE if sampler is None else False

loader =,
return loader

Hope it helps you to work it out.