When training with DataParallel in parallel, I encountered a data distribution issue

I am trying to train my CNN network on multiple GPUs using nn.DataParallel . However, I’ve encountered the following issue: When distributing the input data [a1, a2, a3, a4] to two GPUs (gpu:0 and gpu:1), I expected each GPU to receive data [a1, a2] and [a3, a4] , respectively. However, in reality, I’m getting [a1, a2] on gpu:0 and [0s or data from an unknown source] on gpu:1.
original input:
tensor([[1., 0., 0., 1.],
[0., 0., 1., 1.],
[0., 1., 0., 1.],
[0., 1., 1., 0.]], device=‘cuda:0’, dtype=torch.float64)
I use hook to obtain and print the data that flow in parallel gpus:
tensor([[1., 0., 0., 1.],
[0., 0., 1., 1.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], device=‘cuda:0’, dtype=torch.float64)
Can someone tell me why there is such a problem?
Post script:The training data for the network is generated based on the latest model, so there isn’t an existing dataset。

another questions:How to obtain output of intermediate layer When the model is trained on multiple GPUs

Based on the issue your system might have trouble with the p2p connectivity between your GPUs.
You could check this NCCL FAQ and check if IOMMU might need to be disabled.

Thank you for your help, the GPU I use is NVIDIA 4090. I just learned that 4090 does not support nvlink. The above error is most likely caused by that.