Multi-GPU training hangs

ArneNx · May 19, 2020, 2:55pm

I tried parallelizing my training to multiple GPUs using DataParallel on two GTX1080 GPUs.
The training hangs after the start and I cannot even kill the docker container this is running in.
I can execute the same code on a single GPU without any problems.

I already tried the solutions described here and here.
Both didn’t help.

When I run p2pBandwidthLatencyTest, I get the following output:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1080, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0  44.99   4.63
     1   4.73  49.25
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 176.67   0.46
     1   0.66 255.81
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 257.41   5.53
     1   5.40 256.42
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 255.38   1.12
     1   1.28 254.56
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.20  10.46
     1  10.36   1.24

   CPU     0      1
     0   7.17  13.92
     1  13.99   7.51
P2P=Enabled Latency (P2P Writes) Matrix (us)

(the test hangs after that)

ptrblck · May 20, 2020, 7:23am

Could you reinstall the CUDA driver and NCCL, if you are running the test on bare metal?
Also, which driver are you using?

ArneNx · June 9, 2020, 12:05pm

The driver version is 440.82 and CUDA version 10.2.
I think the drivers are the problem, since I had the same problem on any machine that uses these drivers (some of them have V100s instead of 1080s).
However, I was able to get multi-GPU training running with the same driver version on another machine when I installed the driver via PPA instead of installing from NVIDIA directly (as it was done on the other machines).

Thanks for your suggestion to look at the drivers.
I’ll report if reinstalling on the other machines doesn’t solve the problem, but for now I consider this problem solved.

rmeng · August 3, 2020, 2:15am

We had a similar problem when running inference, with some changes to libtorch that led to an unexpected P2P memory access from GPU0 to GPU1. If you disable IOMMU in the kernel setting the hang goes away. We are on 440.59. The problem started after we upgraded our driver. You would also see things like DMAR:[fault reason 05] PTE Write access is not set DMA write in the kernel log.