DDP training on RTX 4090 (ADA, cu118)

Dear @ptrblck of course! We can schedule a video call if you want.

Now system freeze during NCCL test, without NCCL_P2P_DISABLE=1.
NCCL freezing whole system, after showing a headers, usually system are completly unable to response.

Output with NCCL_P2P_DISABLE=1

ubuntu@g1:~/nvidia/nccl-tests$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 7064 on g1 device 0 [0x41] NVIDIA GeForce RTX 4090

Rank 1 Group 0 Pid 7064 on g1 device 1 [0x61] NVIDIA GeForce RTX 4090

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

       8             2     float     sum      -1     6.72    0.00    0.00      0     6.64    0.00    0.00      0
      16             4     float     sum      -1     6.61    0.00    0.00      0     6.78    0.00    0.00      0
      32             8     float     sum      -1     6.59    0.00    0.00      0     6.76    0.00    0.00      0
      64            16     float     sum      -1     6.90    0.01    0.01      0     6.83    0.01    0.01      0
     128            32     float     sum      -1     6.92    0.02    0.02      0     6.68    0.02    0.02      0
     256            64     float     sum      -1     7.03    0.04    0.04      0     6.88    0.04    0.04      0
     512           128     float     sum      -1     7.00    0.07    0.07      0     6.98    0.07    0.07      0
    1024           256     float     sum      -1     7.29    0.14    0.14      0     7.17    0.14    0.14      0
    2048           512     float     sum      -1     7.40    0.28    0.28      0     7.14    0.29    0.29      0
    4096          1024     float     sum      -1     7.69    0.53    0.53      0     7.63    0.54    0.54      0
    8192          2048     float     sum      -1     8.54    0.96    0.96      0     8.42    0.97    0.97      0
   16384          4096     float     sum      -1    10.31    1.59    1.59      0    10.30    1.59    1.59      0
   32768          8192     float     sum      -1    14.18    2.31    2.31      0    14.09    2.33    2.33      0
   65536         16384     float     sum      -1    21.40    3.06    3.06      0    21.35    3.07    3.07      0
  131072         32768     float     sum      -1    31.69    4.14    4.14      0    31.62    4.15    4.15      0
  262144         65536     float     sum      -1    48.13    5.45    5.45      0    47.92    5.47    5.47      0
  524288        131072     float     sum      -1    77.00    6.81    6.81      0    75.53    6.94    6.94      0
 1048576        262144     float     sum      -1    137.5    7.63    7.63      0    136.8    7.67    7.67      0
 2097152        524288     float     sum      -1    262.2    8.00    8.00      0    262.4    7.99    7.99      0
 4194304       1048576     float     sum      -1    513.7    8.16    8.16      0    514.8    8.15    8.15      0
 8388608       2097152     float     sum      -1   1034.2    8.11    8.11      0   1034.2    8.11    8.11      0
16777216       4194304     float     sum      -1   2078.5    8.07    8.07      0   2080.2    8.07    8.07      0
33554432       8388608     float     sum      -1   4175.0    8.04    8.04      0   4164.4    8.06    8.06      0
67108864      16777216     float     sum      -1   8332.2    8.05    8.05      0   8330.4    8.06    8.06      0

134217728 33554432 float sum -1 16678 8.05 8.05 0 16677 8.05 8.05 0
268435456 67108864 float sum -1 33378 8.04 8.04 0 33380 8.04 8.04 0

Out of bounds values : 0 OK

Avg bus bandwidth : 3.75734

Example output WITHOUT: NCCL_P2P_DISABLE=1

ubuntu@g1:~/nvidia/nccl-tests$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 7064 on g1 device 0 [0x41] NVIDIA GeForce RTX 4090

Rank 1 Group 0 Pid 7064 on g1 device 1 [0x61] NVIDIA GeForce RTX 4090

FREEZE

I checking almost everything:

  • WRX80SE motherboard settings,
  • PCI gen4 or 3,
  • ACS, IOMMU etc…

If you want we will love to schedule a call and stream everything.
Our company calculating a human IgG antibodies using AI, and it’s too expensive for us A100 or H100.