For demonstration, here only 2 processes. You see it in nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:61:00.0 Off | 0 |
| N/A 39C P0 57W / 300W | 3433MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:62:00.0 Off | 0 |
| N/A 36C P0 58W / 300W | 2615MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 231620 C ...0-torch2.1/bin/python3.10 2612MiB |
| 0 N/A N/A 231621 C ...0-torch2.1/bin/python3.10 818MiB |
| 1 N/A N/A 231621 C ...0-torch2.1/bin/python3.10 2612MiB |
+-----------------------------------------------------------------------------+
You see that proc 231621 also reserved some memory on GPU 0.
In my actual use case, I wanted to use 16 GPUs (DGX with V100). And every worker reserved some memory on GPU 0, which then caused an OOM, because not so much memory was left anymore: OutOfMemoryError: CUDA out of memory. Tried to allocate 52.00 MiB. GPU 0 has a total capacty of 31.74 GiB of which 39.62 MiB is free. Process 101553 has 1.19 GiB memory in use. Process 101552 has 1.19 GiB memory in use. Process 101563 has 1.19 GiB memory in use. Process 101549 has 1.19 GiB memory in use. Process 101555 has 1.19 GiB memory in use. Process 101558 has 1.19 GiB memory in use. Process 101557 has 1.19 GiB memory in use. Process 101562 has 1.19 GiB memory in use. Process 101556 has 1.19 GiB memory in use. Process 101560 has 1.19 GiB memory in use. Process 101554 has 1.19 GiB memory in use. Process 101550 has 1.19 GiB memory in use. Process 101561 has 1.19 GiB memory in use. Process 101559 has 1.19 GiB memory in use. Process 101551 has 1.19 GiB memory in use. Including non-PyTorch memory, this process has 13.82 GiB memory in use. Of the allocated memory 11.54 GiB is allocated by PyTorch, and 677.26 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
So, out of the 32GB this GPU has (V100), 15*1.19=17.85GB is reserved by other procs, so only about 14GB is left then, which is too less for my actual use case.
As you see, this is obviously a problem. Is this expected behavior? Am I doing sth wrong?
I actually traced back to when this memory reservation happen: It happens in DistributedDataParallel.__init__ inside _verify_param_shape_across_processes. Before that line, there is no reserved memory on GPU 0 by rank 1, and after that line, I see the memory in nvidia-smi.
Your processes are all initializing a CUDA context on the default device. This can be avoided by setting the device via torch.cuda.set_device. Could you post a minimal and executable code snippet reproducing the issue, if you get stuck and it doesn’t help?
Oh, adding torch.cuda.set_device really fixed the problem! But this looks like a bug, right? You see that the model is on the right device, and then inside the DDP wrapping module, this happens, more specifically, the memory on the wrong device gets reserved inside of _verify_param_shape_across_processes.