Hello, I’m trying to train my model on multiple GPUs. I got RuntimeError.
Here the code snippets of defintion of zoo model:
class BasicStem(nn.Module):
def __init__(self, in_channels=3, out_channels=64, norm="BN"):
super().__init__()
self.conv1 = nn.Conv2d(
in_channels,
out_channels,
kernel_size=7,
stride=2,
padding=3,
bias=False
)
def forward(self, x):
images = preprocess_image(x)
print("x", images.tensor.size())
x = self.conv1(images.tensor)
x = F.relu_(x)
x = F.max_pool2d(x, kernel_size=3, stride=2, padding=1)
return x
def preprocess_image(batched_inputs):
"""
Normalize, pad and batch the input images.
"""
device = "cuda"
size_divisibility = 0
images = [x["image"].to(device) for x in batched_inputs]
image_scales = [x["im_scale"] for x in batched_inputs]
images = ImageList.from_tensors(images, size_divisibility)
images.image_scales = image_scales
return images
if __name__ == "__main__":
cache_file = "data/dummy_data.pkl"
if os.path.exists(cache_file):
fid = gzip.open(cache_file, 'rb')
blobs = cPickle.load(fid)
fid.close()
data = [blobs, blobs, blobs]
epoch = 10
model = BasicStem().cuda()
if torch.cuda.device_count() > 1:
model = torch.nn.DataParallel(model)
for i in range(epoch):
model.forward(data)
It works correctly if I run this code on single GPU. Debug Info on single GPU:
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
x torch.Size([3, 3, 600, 800])
After launching code snippets on 2 GPUs I got Runtime Error as follows:
ser-v8:19723:19723 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
ser-v8:19723:19723 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via P2P/direct pointer
ser-v8:19723:19723 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/direct pointer
ser-v8:19723:19723 [1] NCCL INFO Ring 01 : 1[1] -> 0[0] via P2P/direct pointer
ser-v8:19723:19723 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/direct pointer
ser-v8:19723:19723 [1] NCCL INFO Ring 02 : 1[1] -> 0[0] via P2P/direct pointer
ser-v8:19723:19723 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/direct pointer
ser-v8:19723:19723 [1] NCCL INFO Ring 03 : 1[1] -> 0[0] via P2P/direct pointer
ser-v8:19723:19723 [0] NCCL INFO Launch mode Group/CGMD
x torch.Size([3, 2, 600, 800])
x torch.Size([3, 1, 600, 800])
Traceback (most recent call last):
File "tools/zoo.py", line 83, in <module>
model.forward(data)
File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/zoo_dummy/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "tools/zoo.py", line 50, in forward
x = self.conv1(images.tensor)
File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 345, in forward
return self.conv2d_forward(input, self.weight)
File "/zoo_dummy/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size 64 3 7 7, expected input[3, 2, 600, 800] to have 3 channels, but got 2 channels instead
I also had a look at https://discuss.pytorch.org/t/runtimeerror-given-groups-1-weight-of-size-64-3-7-7-expected-input-3-1-224-224-to-have-3-channels-but-got-1-channels-instead/30153/9 https://discuss.pytorch.org/t/run-pytorch-on-multiple-gpus/20932
My data has been organized in NCHW format, batche size = 3, channel size = 3. As I understand correctly, as shows on https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html#create-model-and-dataparallel. The data should be scattered on multiple devices by slicing the first dimension, here the data should be split from a batch, say what expected is like:
x torch.Size([1, 3, 600, 800])
x torch.Size([2, 3, 600, 800])
But instead of it, the data channels are sliced unexpected, as shown:
x torch.Size([3, 2, 600, 800])
x torch.Size([3, 1, 600, 800])
Results are tested under following settings:
torch 1.4.0+cu100
torchvision 0.5.0+cu100
NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0
Tesla V100-SXM2
Ubuntu 18.04
I also tried to increase the batch size. However, this does not change anything.
x torch.Size([8, 2, 600, 800])
x torch.Size([8, 1, 600, 800])
Any inputs will be appreciated.