Strange behavior nn.Dataparallel

Hi all! I have 4 GPUs 1080 Ti, and when I run training inception_v3 net on multiple GPU model have strange behavior. I didnt rewrite my code much from training on 1 GPU, just add:
model = nn.DataParallel(model, device_ids=[0,1,2,3]).cuda()
When I run script with device_ids=[0,1] GPUs full utilized and train much faster, when I run script with device_ids=[0,1,2] or device_ids=[0,1,2, 3] script starting (GPU full utilized in nvidia-smi, but reserved memory on card small: 1Gb on first card, and 500 Mb on other) and model dont train. Where I am wrong? And I have 2 cores processor. Thanks for response. Sorry for my English

Does your code just hangs using all 4 GPUs or does “model doesn’t train” mean that the training loss is worse than on a single device?

With 1 or 2 GPUs model train, loss decreased, all good. With 3 or 4 GPUs sript run, but train dont work, I logged every epoch loss and accuracy. I run sript overnight, but nothing logged.

Could you try to run the code only on device 2 and 3, if 0 and 1 are working?
Set the device via .to('cuda:2') or .to('cuda:3').

If I set torch.device(“cuda:2”) or torch.device(“cuda:3”), I got error: tensors must be on the same device. If I set nn.DataParallel(model, device_ids=[1,2,3]).cuda(), on first GPU (with index 0) free memory decreased (same as I run training on it) and after that raise error: tensors must be on the same device. In training block of code batch of images send to GPU (input.to(device)). May be this happen because processor have only 2 cores?

Could you use device ids 0 and 1 in your script for nn.DataParallel and launch the script via:

CUDA_VISIBLE_DEVICES=2,3 python script.py args

Running script with this parameters launch script, but used GPU memory on device 2 and 3: 1Gb and 500 Mb and model dont training.

Thanks for the test.
Could you run the code on a single device now and check, if it’s working on GPU2 and GPU3?

Yes, if set in terminal CUDA_VISIBLE_DEVICE=2, or 3 training is running ok.

So you only see the hand when you are using nn.DataParallel with device 2 and 3?
Could you run the p2pBandwidthLatencyTest from the CUDA samples?

p2pBandwidthLatencyTest works 24 hours and dont done.
P2P Connectivity Matrix
D\D 0 1 2 3
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 352.55 0.41 0.41 0.41
1 0.39 216.53 0.39 0.39
2 0.39 0.39 349.40 0.39
3 0.39 0.39 0.39 350.65
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 353.51 0.41 0.41 0.41
1 0.35 376.32 0.00 0.00
2 0.21 0.00 376.32 0.00
3 0.21 0.00 0.00 377.78
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 374.52 0.68 0.68 0.69
1 0.69 375.24 0.67 0.67
2 0.69 0.68 374.52 0.68
3 0.69 0.68 0.68 374.70
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 377.23 0.68 0.68 0.68
1 0.68 372.56 0.55 0.55
2 0.67 0.55 370.96 0.55
3 0.67 0.55 0.55 370.26
P2P=Disabled Latency Matrix (us)

Thanks for the information. This points towards some communication issues between the GPUs.
Could you run the PyTorch code using NCCL_P2P_DISABLE=1 to use shared memory instead of p2p access?

Thanks! this option works for me.