DDP training on RTX 4090 (ADA, cu118)

Hi,
DDP training hangs with 100% CPU and no progress when using multiple RTX 4090s. Torch get stuck at

  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 109, in join
    ready = multiprocessing.connection.wait(
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)

NOTE: I’m using nvcr.io/nvidia/pytorch:22.11-py3 container, that comes with torch==1.13.0a0+936e930
NOTE: training on a single GPU works fine

Have anyone found a workaround ?
Best

What python version are you using?

nvcr.io/nvidia/pytorch:22.11-py3 comes with python 3.8.10. I have the same behavior with python 3.10.9

After further investigation the problem was due to NCCL backend trying to use peer to peer (P2P) transport.
Forcing NCCL_P2P_DISABLE=1 fixed the issue :+1:

3 Likes

Thanks, this solves my problem with RTX4090.

export NCCL_P2P_DISABLE=1 sorta works for models like GitHub - The-AI-Summer/pytorch-ddp: code for the ddp tutorial. (DDP works, slowly; DP gives NaN loss).
Yet for the life of me I cannot get it working on my models (~= CLIP transformer). If NCCL is enabled, it hangs with 100% volatile GPU utilization, but the processes can be killed with ^C or kill -9. If NCCL is disabled, it hard freezes the system.

This was working perfectly well a few days ago on two 2080Ti with otherwise identical hardware. Model trains fine on either one of the single 4090s. IOMMU is disabled in BIOS. memtest good, gpu_burn reports no errors either; hardware seems fine.

These GPUs need nvidia-driver >= 520, (using 525.78.01) which comes with cuda 12.0. (Related issue: torch_compile also doesn’t work b/c they need sm_89 etc). I might just train on one GPU until the new hardware bugs get ironed out …

I spend more than 80h to debugging where is a problem.

I have a 4x 4090 on WRX-80SE, and second with 7x 4090, both the same.
Cards do not cooperate without NCCL_P2P_DISABLE=1.

But when you enable NCCL_P2P_DISABLE=1, then internal transfer between cards is ~2GB/s, what is EXTREMELY slow. 8x A100 have ~600GB/s p2p communication.

Probably Nvidia blocked p2p connection between 4090, what completely disable that card to use them for training AI models that do not fit in 24GB.

I would be super grateful if someone have a solution, or other configuration on EPYC or Intel that works.

Ran into the same problem on a dual 4090 system. Tried the latest drivers from nVidia and the 525 available from stock Ubuntu. Disabling P2P makes it “work”, but then performance is 1it/s compared to 4it/s on my otherwise equivalent dual 3090 system.

The CUDA p2p sample runs without any errors or stalls (cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest at master · NVIDIA/cuda-samples · GitHub), but reports atrocious bandwidth between GPUs.

I tried disabling ACS in my BIOS as suggested by (Troubleshooting — NCCL 2.16.2 documentation), to no available. My dual 3090 system has ACS enabled and it doesn’t have any issues, so I doubt that’s related anyway.

2x Gigabyte 4090
ASUS Prime X570-Pro (Latest BIOS)
Ryzen 5900X
Ubuntu 20.04.1

Could you post the p2p sample outputs and run additional NCCL tests from this repository, please?
Disabling p2p should not be necessary and I would like to try rebuilding a test system close to your setup to debug the issue.

CC @greg_warzecha

Dear @ptrblck of course! We can schedule a video call if you want.

Now system freeze during NCCL test, without NCCL_P2P_DISABLE=1.
NCCL freezing whole system, after showing a headers, usually system are completly unable to response.

Output with NCCL_P2P_DISABLE=1

ubuntu@g1:~/nvidia/nccl-tests$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 7064 on g1 device 0 [0x41] NVIDIA GeForce RTX 4090

Rank 1 Group 0 Pid 7064 on g1 device 1 [0x61] NVIDIA GeForce RTX 4090

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

       8             2     float     sum      -1     6.72    0.00    0.00      0     6.64    0.00    0.00      0
      16             4     float     sum      -1     6.61    0.00    0.00      0     6.78    0.00    0.00      0
      32             8     float     sum      -1     6.59    0.00    0.00      0     6.76    0.00    0.00      0
      64            16     float     sum      -1     6.90    0.01    0.01      0     6.83    0.01    0.01      0
     128            32     float     sum      -1     6.92    0.02    0.02      0     6.68    0.02    0.02      0
     256            64     float     sum      -1     7.03    0.04    0.04      0     6.88    0.04    0.04      0
     512           128     float     sum      -1     7.00    0.07    0.07      0     6.98    0.07    0.07      0
    1024           256     float     sum      -1     7.29    0.14    0.14      0     7.17    0.14    0.14      0
    2048           512     float     sum      -1     7.40    0.28    0.28      0     7.14    0.29    0.29      0
    4096          1024     float     sum      -1     7.69    0.53    0.53      0     7.63    0.54    0.54      0
    8192          2048     float     sum      -1     8.54    0.96    0.96      0     8.42    0.97    0.97      0
   16384          4096     float     sum      -1    10.31    1.59    1.59      0    10.30    1.59    1.59      0
   32768          8192     float     sum      -1    14.18    2.31    2.31      0    14.09    2.33    2.33      0
   65536         16384     float     sum      -1    21.40    3.06    3.06      0    21.35    3.07    3.07      0
  131072         32768     float     sum      -1    31.69    4.14    4.14      0    31.62    4.15    4.15      0
  262144         65536     float     sum      -1    48.13    5.45    5.45      0    47.92    5.47    5.47      0
  524288        131072     float     sum      -1    77.00    6.81    6.81      0    75.53    6.94    6.94      0
 1048576        262144     float     sum      -1    137.5    7.63    7.63      0    136.8    7.67    7.67      0
 2097152        524288     float     sum      -1    262.2    8.00    8.00      0    262.4    7.99    7.99      0
 4194304       1048576     float     sum      -1    513.7    8.16    8.16      0    514.8    8.15    8.15      0
 8388608       2097152     float     sum      -1   1034.2    8.11    8.11      0   1034.2    8.11    8.11      0
16777216       4194304     float     sum      -1   2078.5    8.07    8.07      0   2080.2    8.07    8.07      0
33554432       8388608     float     sum      -1   4175.0    8.04    8.04      0   4164.4    8.06    8.06      0
67108864      16777216     float     sum      -1   8332.2    8.05    8.05      0   8330.4    8.06    8.06      0

134217728 33554432 float sum -1 16678 8.05 8.05 0 16677 8.05 8.05 0
268435456 67108864 float sum -1 33378 8.04 8.04 0 33380 8.04 8.04 0

Out of bounds values : 0 OK

Avg bus bandwidth : 3.75734

Example output WITHOUT: NCCL_P2P_DISABLE=1

ubuntu@g1:~/nvidia/nccl-tests$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 7064 on g1 device 0 [0x41] NVIDIA GeForce RTX 4090

Rank 1 Group 0 Pid 7064 on g1 device 1 [0x61] NVIDIA GeForce RTX 4090

FREEZE

I checking almost everything:

  • WRX80SE motherboard settings,
  • PCI gen4 or 3,
  • ACS, IOMMU etc…

If you want we will love to schedule a call and stream everything.
Our company calculating a human IgG antibodies using AI, and it’s too expensive for us A100 or H100.

Not sure if you meant me or an earlier poster, but since I’m here:

Running inside sudo docker run --rm --gpus all -it pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel

P2P sample:

root@5552c3798cde:/cuda-samples/bin/x86_64/linux/release# ./p2pBandwidthLatencyTest 
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 910.55   6.63 
     1   6.13 922.37 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 913.21   6.78 
     1  17.79 922.92 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 916.96   8.83 
     1   7.87 924.01 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 917.77  13.55 
     1  13.55 924.26 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   1.38  18.07 
     1  18.73   1.41 

   CPU     0      1 
     0   1.84   5.88 
     1   5.69   1.75 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   1.39   1.60 
     1   1.02   1.40 

   CPU     0      1 
     0   1.82   1.46 
     1   1.49   1.78 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

NCCL Tests
Wasn’t sure which of the nccl tests to run or with what options, so I tried the example.

root@89884fa4021a:/nccl-tests# ./build/all_reduce_perf 
# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    946 on 89884fa4021a device  0 [0x04] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    33554432       8388608     float     sum      -1    18.55  1809.10    0.00      0     0.25  136123.46    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

root@89884fa4021a:/nccl-tests# 
root@89884fa4021a:/nccl-tests# 
root@89884fa4021a:/nccl-tests# 
root@89884fa4021a:/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    955 on 89884fa4021a device  0 [0x04] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid    955 on 89884fa4021a device  1 [0x09] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

The first command ran on only one GPU. The second, as specified by the README, ran on both GPUs and … froze. I waited several minutes and didn’t see any progress, much like my DDP workload. Responds to Ctrl+C and immediately quits though.

nvidia-smi

root@89884fa4021a:/nccl-tests# nvidia-smi
Tue Jan 24 20:38:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:04:00.0 Off |                  Off |
|  0%   44C    P8    26W / 450W |      6MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:09:00.0 Off |                  Off |
|  0%   47C    P8    29W / 450W |     73MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Seems to be AMD specific issue with multiple 4090s NCCL P2P functionality. Unsure who will resolve it when.

Here is another thread - awaiting some progress.