DDP training on RTX 4090 (ADA, cu118)

Hi,
DDP training hangs with 100% CPU and no progress when using multiple RTX 4090s. Torch get stuck at

  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 109, in join
    ready = multiprocessing.connection.wait(
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)

NOTE: I’m using nvcr.io/nvidia/pytorch:22.11-py3 container, that comes with torch==1.13.0a0+936e930
NOTE: training on a single GPU works fine

Have anyone found a workaround ?
Best

What python version are you using?

nvcr.io/nvidia/pytorch:22.11-py3 comes with python 3.8.10. I have the same behavior with python 3.10.9

After further investigation the problem was due to NCCL backend trying to use peer to peer (P2P) transport.
Forcing NCCL_P2P_DISABLE=1 fixed the issue :+1:

4 Likes

Thanks, this solves my problem with RTX4090.

export NCCL_P2P_DISABLE=1 sorta works for models like GitHub - The-AI-Summer/pytorch-ddp: code for the ddp tutorial. (DDP works, slowly; DP gives NaN loss).
Yet for the life of me I cannot get it working on my models (~= CLIP transformer). If NCCL is enabled, it hangs with 100% volatile GPU utilization, but the processes can be killed with ^C or kill -9. If NCCL is disabled, it hard freezes the system.

This was working perfectly well a few days ago on two 2080Ti with otherwise identical hardware. Model trains fine on either one of the single 4090s. IOMMU is disabled in BIOS. memtest good, gpu_burn reports no errors either; hardware seems fine.

These GPUs need nvidia-driver >= 520, (using 525.78.01) which comes with cuda 12.0. (Related issue: torch_compile also doesn’t work b/c they need sm_89 etc). I might just train on one GPU until the new hardware bugs get ironed out …

I spend more than 80h to debugging where is a problem.

I have a 4x 4090 on WRX-80SE, and second with 7x 4090, both the same.
Cards do not cooperate without NCCL_P2P_DISABLE=1.

But when you enable NCCL_P2P_DISABLE=1, then internal transfer between cards is ~2GB/s, what is EXTREMELY slow. 8x A100 have ~600GB/s p2p communication.

Probably Nvidia blocked p2p connection between 4090, what completely disable that card to use them for training AI models that do not fit in 24GB.

I would be super grateful if someone have a solution, or other configuration on EPYC or Intel that works.

Ran into the same problem on a dual 4090 system. Tried the latest drivers from nVidia and the 525 available from stock Ubuntu. Disabling P2P makes it “work”, but then performance is 1it/s compared to 4it/s on my otherwise equivalent dual 3090 system.

The CUDA p2p sample runs without any errors or stalls (cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest at master · NVIDIA/cuda-samples · GitHub), but reports atrocious bandwidth between GPUs.

I tried disabling ACS in my BIOS as suggested by (Troubleshooting — NCCL 2.16.2 documentation), to no available. My dual 3090 system has ACS enabled and it doesn’t have any issues, so I doubt that’s related anyway.

2x Gigabyte 4090
ASUS Prime X570-Pro (Latest BIOS)
Ryzen 5900X
Ubuntu 20.04.1

Could you post the p2p sample outputs and run additional NCCL tests from this repository, please?
Disabling p2p should not be necessary and I would like to try rebuilding a test system close to your setup to debug the issue.

CC @greg_warzecha

Dear @ptrblck of course! We can schedule a video call if you want.

Now system freeze during NCCL test, without NCCL_P2P_DISABLE=1.
NCCL freezing whole system, after showing a headers, usually system are completly unable to response.

Output with NCCL_P2P_DISABLE=1

ubuntu@g1:~/nvidia/nccl-tests$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 7064 on g1 device 0 [0x41] NVIDIA GeForce RTX 4090

Rank 1 Group 0 Pid 7064 on g1 device 1 [0x61] NVIDIA GeForce RTX 4090

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

       8             2     float     sum      -1     6.72    0.00    0.00      0     6.64    0.00    0.00      0
      16             4     float     sum      -1     6.61    0.00    0.00      0     6.78    0.00    0.00      0
      32             8     float     sum      -1     6.59    0.00    0.00      0     6.76    0.00    0.00      0
      64            16     float     sum      -1     6.90    0.01    0.01      0     6.83    0.01    0.01      0
     128            32     float     sum      -1     6.92    0.02    0.02      0     6.68    0.02    0.02      0
     256            64     float     sum      -1     7.03    0.04    0.04      0     6.88    0.04    0.04      0
     512           128     float     sum      -1     7.00    0.07    0.07      0     6.98    0.07    0.07      0
    1024           256     float     sum      -1     7.29    0.14    0.14      0     7.17    0.14    0.14      0
    2048           512     float     sum      -1     7.40    0.28    0.28      0     7.14    0.29    0.29      0
    4096          1024     float     sum      -1     7.69    0.53    0.53      0     7.63    0.54    0.54      0
    8192          2048     float     sum      -1     8.54    0.96    0.96      0     8.42    0.97    0.97      0
   16384          4096     float     sum      -1    10.31    1.59    1.59      0    10.30    1.59    1.59      0
   32768          8192     float     sum      -1    14.18    2.31    2.31      0    14.09    2.33    2.33      0
   65536         16384     float     sum      -1    21.40    3.06    3.06      0    21.35    3.07    3.07      0
  131072         32768     float     sum      -1    31.69    4.14    4.14      0    31.62    4.15    4.15      0
  262144         65536     float     sum      -1    48.13    5.45    5.45      0    47.92    5.47    5.47      0
  524288        131072     float     sum      -1    77.00    6.81    6.81      0    75.53    6.94    6.94      0
 1048576        262144     float     sum      -1    137.5    7.63    7.63      0    136.8    7.67    7.67      0
 2097152        524288     float     sum      -1    262.2    8.00    8.00      0    262.4    7.99    7.99      0
 4194304       1048576     float     sum      -1    513.7    8.16    8.16      0    514.8    8.15    8.15      0
 8388608       2097152     float     sum      -1   1034.2    8.11    8.11      0   1034.2    8.11    8.11      0
16777216       4194304     float     sum      -1   2078.5    8.07    8.07      0   2080.2    8.07    8.07      0
33554432       8388608     float     sum      -1   4175.0    8.04    8.04      0   4164.4    8.06    8.06      0
67108864      16777216     float     sum      -1   8332.2    8.05    8.05      0   8330.4    8.06    8.06      0

134217728 33554432 float sum -1 16678 8.05 8.05 0 16677 8.05 8.05 0
268435456 67108864 float sum -1 33378 8.04 8.04 0 33380 8.04 8.04 0

Out of bounds values : 0 OK

Avg bus bandwidth : 3.75734

Example output WITHOUT: NCCL_P2P_DISABLE=1

ubuntu@g1:~/nvidia/nccl-tests$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 7064 on g1 device 0 [0x41] NVIDIA GeForce RTX 4090

Rank 1 Group 0 Pid 7064 on g1 device 1 [0x61] NVIDIA GeForce RTX 4090

FREEZE

I checking almost everything:

  • WRX80SE motherboard settings,
  • PCI gen4 or 3,
  • ACS, IOMMU etc…

If you want we will love to schedule a call and stream everything.
Our company calculating a human IgG antibodies using AI, and it’s too expensive for us A100 or H100.

Not sure if you meant me or an earlier poster, but since I’m here:

Running inside sudo docker run --rm --gpus all -it pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel

P2P sample:

root@5552c3798cde:/cuda-samples/bin/x86_64/linux/release# ./p2pBandwidthLatencyTest 
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 910.55   6.63 
     1   6.13 922.37 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 913.21   6.78 
     1  17.79 922.92 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 916.96   8.83 
     1   7.87 924.01 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 917.77  13.55 
     1  13.55 924.26 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   1.38  18.07 
     1  18.73   1.41 

   CPU     0      1 
     0   1.84   5.88 
     1   5.69   1.75 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   1.39   1.60 
     1   1.02   1.40 

   CPU     0      1 
     0   1.82   1.46 
     1   1.49   1.78 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

NCCL Tests
Wasn’t sure which of the nccl tests to run or with what options, so I tried the example.

root@89884fa4021a:/nccl-tests# ./build/all_reduce_perf 
# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    946 on 89884fa4021a device  0 [0x04] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    33554432       8388608     float     sum      -1    18.55  1809.10    0.00      0     0.25  136123.46    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

root@89884fa4021a:/nccl-tests# 
root@89884fa4021a:/nccl-tests# 
root@89884fa4021a:/nccl-tests# 
root@89884fa4021a:/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    955 on 89884fa4021a device  0 [0x04] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid    955 on 89884fa4021a device  1 [0x09] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

The first command ran on only one GPU. The second, as specified by the README, ran on both GPUs and … froze. I waited several minutes and didn’t see any progress, much like my DDP workload. Responds to Ctrl+C and immediately quits though.

nvidia-smi

root@89884fa4021a:/nccl-tests# nvidia-smi
Tue Jan 24 20:38:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:04:00.0 Off |                  Off |
|  0%   44C    P8    26W / 450W |      6MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:09:00.0 Off |                  Off |
|  0%   47C    P8    29W / 450W |     73MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Seems to be AMD specific issue with multiple 4090s NCCL P2P functionality. Unsure who will resolve it when.

1 Like

Here is another thread - awaiting some progress.

1 Like

I’m not sure if it is fully related to the issue, but on my 2x4090 I get the following:

import torch
v = torch.randn(5, device='cuda:0')
print(v)
print(v.to('cuda:1'))
print(v.to('cpu').to('cuda:1'))

tensor([ 1.5336, 0.8161, -0.9325, -0.9513, 0.1360], device=‘cuda:0’)
tensor([0., 0., 0., 0., 0.], device=‘cuda:1’)
tensor([ 1.5336, 0.8161, -0.9325, -0.9513, 0.1360], device=‘cuda:1’)

It looks there is a bug (very likely at NVIDIA site) in GPU-to-GPU memory copy. So it sets everything to zeros. I have the latest NVIDIA driver and tried the latest stable PyTorch as well as Pytorch 2.0 preview.

And here is NVIDIA P2P test output:

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 12.61GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 1: val = 0.000000, ref = 4.000000
Verification error @ element 2: val = 0.000000, ref = 8.000000
Verification error @ element 3: val = 0.000000, ref = 12.000000
Verification error @ element 4: val = 0.000000, ref = 16.000000
Verification error @ element 5: val = 0.000000, ref = 20.000000
Verification error @ element 6: val = 0.000000, ref = 24.000000
Verification error @ element 7: val = 0.000000, ref = 28.000000
Verification error @ element 8: val = 0.000000, ref = 32.000000
Verification error @ element 9: val = 0.000000, ref = 36.000000
Verification error @ element 10: val = 0.000000, ref = 40.000000
Verification error @ element 11: val = 0.000000, ref = 44.000000
Verification error @ element 12: val = 0.000000, ref = 48.000000
Disabling peer access...
Shutting down...
Test failed!
1 Like

Could you disable p2p via export NCCL_P2P_DISABLE=1 and check if this would solve the issue, as you might be running into a driver bug.

Thank you for checking. It doesn’t change the behavior. I also disabled IOMMU and SVM in bios.

import os
os.environ["NCCL_P2P_DISABLE"] = "1"
​import torch

print('Test 1')
v = torch.randn(5, device='cuda:0')
print(v)
print(v.to('cuda:1'))
print(v.to('cpu').to('cuda:1'))

print('Test 2')
v = torch.randn(5, device='cuda:0')
print(v)
print(v.to('cuda:1'))
print(v.to('cpu').to('cuda:1'))

Test 1
tensor([-0.1360, -1.5022, -1.9172, 0.8753, 0.5528], device=‘cuda:0’)
tensor([0., 0., 0., 0., 0.], device=‘cuda:1’)
tensor([-0.1360, -1.5022, -1.9172, 0.8753, 0.5528], device=‘cuda:1’)
Test 2
tensor([-0.5404, -1.6951, -0.4220, -0.9484, 0.1218], device=‘cuda:0’)
tensor([-0.1360, -1.5022, -1.9172, 0.8753, 0.5528], device=‘cuda:1’)
tensor([-0.5404, -1.6951, -0.4220, -0.9484, 0.1218], device=‘cuda:1’)

It is very likely NVIDIA driver related issue( I just finished building 2x4090 system, and in the initial testing I realized that PyTorch is not working properly with multiple GPUs. Hopefully it will be fixed by NVIDIA soon.

Set the env variable in your terminal, as setting it in the Python script is tricky and fails if it’s set too late.
Based on this discussion disabling p2p should help and yes, it seems to be a driver issue.

Unfortunately it doesn’t work for me even if I set it in the terminal(

Probably some additional issues may be caused by 670e chipset beyond ones 4090 cards have. As I understand, NCCL_P2P_DISABLE should make the driver behavior to be identical to “.to(‘cpu’).to(‘cuda:1’)” in case of device to device copy because it goes through the shared memory, but somehow my test still gives zeros if I call “.to(‘cuda:1’)”.

1 Like

Could you check if IO Virtualization is on and disable it as described here?

Thanks. After running “sudo lspci -vvv | grep ACSCtl”, I get the following:

ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
pcilib: sysfs_read_vpd: read failed: No such device
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

It might indicate that ACS is disabled, or may I be missing something?