Hi,
DDP training hangs with 100% CPU and no progress when using multiple RTX 4090s. Torch get stuck at
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 109, in join
ready = multiprocessing.connection.wait(
File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
NOTE: I’m using nvcr.io/nvidia/pytorch:22.11-py3 container, that comes with torch==1.13.0a0+936e930
NOTE: training on a single GPU works fine
export NCCL_P2P_DISABLE=1 sorta works for models like GitHub - The-AI-Summer/pytorch-ddp: code for the ddp tutorial. (DDP works, slowly; DP gives NaN loss).
Yet for the life of me I cannot get it working on my models (~= CLIP transformer). If NCCL is enabled, it hangs with 100% volatile GPU utilization, but the processes can be killed with ^C or kill -9. If NCCL is disabled, it hard freezes the system.
This was working perfectly well a few days ago on two 2080Ti with otherwise identical hardware. Model trains fine on either one of the single 4090s. IOMMU is disabled in BIOS. memtest good, gpu_burn reports no errors either; hardware seems fine.
These GPUs need nvidia-driver >= 520, (using 525.78.01) which comes with cuda 12.0. (Related issue: torch_compile also doesn’t work b/c they need sm_89 etc). I might just train on one GPU until the new hardware bugs get ironed out …
I spend more than 80h to debugging where is a problem.
I have a 4x 4090 on WRX-80SE, and second with 7x 4090, both the same.
Cards do not cooperate without NCCL_P2P_DISABLE=1.
But when you enable NCCL_P2P_DISABLE=1, then internal transfer between cards is ~2GB/s, what is EXTREMELY slow. 8x A100 have ~600GB/s p2p communication.
Probably Nvidia blocked p2p connection between 4090, what completely disable that card to use them for training AI models that do not fit in 24GB.
I would be super grateful if someone have a solution, or other configuration on EPYC or Intel that works.
Ran into the same problem on a dual 4090 system. Tried the latest drivers from nVidia and the 525 available from stock Ubuntu. Disabling P2P makes it “work”, but then performance is 1it/s compared to 4it/s on my otherwise equivalent dual 3090 system.
I tried disabling ACS in my BIOS as suggested by (Troubleshooting — NCCL 2.16.2 documentation), to no available. My dual 3090 system has ACS enabled and it doesn’t have any issues, so I doubt that’s related anyway.
2x Gigabyte 4090
ASUS Prime X570-Pro (Latest BIOS)
Ryzen 5900X
Ubuntu 20.04.1
Could you post the p2p sample outputs and run additional NCCL tests from this repository, please?
Disabling p2p should not be necessary and I would like to try rebuilding a test system close to your setup to debug the issue.
Dear @ptrblck of course! We can schedule a video call if you want.
Now system freeze during NCCL test, without NCCL_P2P_DISABLE=1.
NCCL freezing whole system, after showing a headers, usually system are completly unable to response.
Rank 0 Group 0 Pid 7064 on g1 device 0 [0x41] NVIDIA GeForce RTX 4090
Rank 1 Group 0 Pid 7064 on g1 device 1 [0x61] NVIDIA GeForce RTX 4090
FREEZE
I checking almost everything:
WRX80SE motherboard settings,
PCI gen4 or 3,
ACS, IOMMU etc…
If you want we will love to schedule a call and stream everything.
Our company calculating a human IgG antibodies using AI, and it’s too expensive for us A100 or H100.
The first command ran on only one GPU. The second, as specified by the README, ran on both GPUs and … froze. I waited several minutes and didn’t see any progress, much like my DDP workload. Responds to Ctrl+C and immediately quits though.
nvidia-smi
root@89884fa4021a:/nccl-tests# nvidia-smi
Tue Jan 24 20:38:42 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01 Driver Version: 525.78.01 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:04:00.0 Off | Off |
| 0% 44C P8 26W / 450W | 6MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:09:00.0 Off | Off |
| 0% 47C P8 29W / 450W | 73MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+