Dear @ptrblck of course! We can schedule a video call if you want.
Now system freeze during NCCL test, without NCCL_P2P_DISABLE=1.
NCCL freezing whole system, after showing a headers, usually system are completly unable to response.
Output with NCCL_P2P_DISABLE=1
ubuntu@g1:~/nvidia/nccl-tests$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2
nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices
Rank 0 Group 0 Pid 7064 on g1 device 0 [0x41] NVIDIA GeForce RTX 4090
Rank 1 Group 0 Pid 7064 on g1 device 1 [0x61] NVIDIA GeForce RTX 4090
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 6.72 0.00 0.00 0 6.64 0.00 0.00 0
16 4 float sum -1 6.61 0.00 0.00 0 6.78 0.00 0.00 0
32 8 float sum -1 6.59 0.00 0.00 0 6.76 0.00 0.00 0
64 16 float sum -1 6.90 0.01 0.01 0 6.83 0.01 0.01 0
128 32 float sum -1 6.92 0.02 0.02 0 6.68 0.02 0.02 0
256 64 float sum -1 7.03 0.04 0.04 0 6.88 0.04 0.04 0
512 128 float sum -1 7.00 0.07 0.07 0 6.98 0.07 0.07 0
1024 256 float sum -1 7.29 0.14 0.14 0 7.17 0.14 0.14 0
2048 512 float sum -1 7.40 0.28 0.28 0 7.14 0.29 0.29 0
4096 1024 float sum -1 7.69 0.53 0.53 0 7.63 0.54 0.54 0
8192 2048 float sum -1 8.54 0.96 0.96 0 8.42 0.97 0.97 0
16384 4096 float sum -1 10.31 1.59 1.59 0 10.30 1.59 1.59 0
32768 8192 float sum -1 14.18 2.31 2.31 0 14.09 2.33 2.33 0
65536 16384 float sum -1 21.40 3.06 3.06 0 21.35 3.07 3.07 0
131072 32768 float sum -1 31.69 4.14 4.14 0 31.62 4.15 4.15 0
262144 65536 float sum -1 48.13 5.45 5.45 0 47.92 5.47 5.47 0
524288 131072 float sum -1 77.00 6.81 6.81 0 75.53 6.94 6.94 0
1048576 262144 float sum -1 137.5 7.63 7.63 0 136.8 7.67 7.67 0
2097152 524288 float sum -1 262.2 8.00 8.00 0 262.4 7.99 7.99 0
4194304 1048576 float sum -1 513.7 8.16 8.16 0 514.8 8.15 8.15 0
8388608 2097152 float sum -1 1034.2 8.11 8.11 0 1034.2 8.11 8.11 0
16777216 4194304 float sum -1 2078.5 8.07 8.07 0 2080.2 8.07 8.07 0
33554432 8388608 float sum -1 4175.0 8.04 8.04 0 4164.4 8.06 8.06 0
67108864 16777216 float sum -1 8332.2 8.05 8.05 0 8330.4 8.06 8.06 0
134217728 33554432 float sum -1 16678 8.05 8.05 0 16677 8.05 8.05 0
268435456 67108864 float sum -1 33378 8.04 8.04 0 33380 8.04 8.04 0
Out of bounds values : 0 OK
Avg bus bandwidth : 3.75734
Example output WITHOUT: NCCL_P2P_DISABLE=1
ubuntu@g1:~/nvidia/nccl-tests$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2
nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices
Rank 0 Group 0 Pid 7064 on g1 device 0 [0x41] NVIDIA GeForce RTX 4090
Rank 1 Group 0 Pid 7064 on g1 device 1 [0x61] NVIDIA GeForce RTX 4090
FREEZE
I checking almost everything:
- WRX80SE motherboard settings,
- PCI gen4 or 3,
- ACS, IOMMU etc…
If you want we will love to schedule a call and stream everything.
Our company calculating a human IgG antibodies using AI, and it’s too expensive for us A100 or H100.