When training with DataParallel in parallel, I encountered a data distribution issue

Chihiro1099 · February 27, 2024, 1:01pm

I am trying to train my CNN network on multiple GPUs using nn.DataParallel . However, I’ve encountered the following issue: When distributing the input data [a1, a2, a3, a4] to two GPUs (gpu:0 and gpu:1), I expected each GPU to receive data [a1, a2] and [a3, a4] , respectively. However, in reality, I’m getting [a1, a2] on gpu:0 and [0s or data from an unknown source] on gpu:1.
original input:
tensor([[1., 0., 0., 1.],
[0., 0., 1., 1.],
[0., 1., 0., 1.],
[0., 1., 1., 0.]], device=‘cuda:0’, dtype=torch.float64)
I use hook to obtain and print the data that flow in parallel gpus:
tensor([[1., 0., 0., 1.],
[0., 0., 1., 1.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], device=‘cuda:0’, dtype=torch.float64)
Can someone tell me why there is such a problem？
Post script：The training data for the network is generated based on the latest model, so there isn’t an existing dataset。

another questions：How to obtain output of intermediate layer When the model is trained on multiple GPUs

ptrblck · February 28, 2024, 8:08pm

Based on the issue your system might have trouble with the p2p connectivity between your GPUs.
You could check this NCCL FAQ and check if IOMMU might need to be disabled.

Chihiro1099 · February 29, 2024, 2:33am

Thank you for your help, the GPU I use is NVIDIA 4090. I just learned that 4090 does not support nvlink. The above error is most likely caused by that.