Hi, I got a huge model with a large image dataset to run so I’m trying to use model parallelism and DDP at the same time just like the part 3 in this tutorial.
However, when I was running the tutorial for trying DDP with NCCL backend, I’m facing the same problem just like this post:
NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
The discussion also reached to the solution:
net.to(f'cuda:{args.local_rank}')
But I just don’t know where and how can I put this line in the right place with the spawn
function in the tutorial. Can anyone provide a sample code?
Another question is how to use cleanup function (provided as below) correctly?
def cleanup():
dist.destroy_process_group()
Should we cleanup like after each epoch, batch or just at the very end of the program?
I’m using a single machine with 4 GPUs (A100*4):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Graphics Device On | 00000000:01:00.0 Off | 0 |
| N/A 40C P0 65W / 275W | 0MiB / 81252MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 Graphics Device On | 00000000:47:00.0 Off | 0 |
| N/A 39C P0 66W / 275W | 0MiB / 81252MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 Graphics Device On | 00000000:81:00.0 Off | 0 |
| N/A 39C P0 66W / 275W | 0MiB / 81252MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 DGX Display On | 00000000:C1:00.0 On | N/A |
| 33% 48C P8 N/A / 50W | 641MiB / 3911MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Graphics Device On | 00000000:C2:00.0 Off | 0 |
| N/A 39C P0 62W / 275W | 0MiB / 81252MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
Python version: 3.9.12
Pytorch version: 1.10.2
torchvision version: 0.11.3
I’m a beginner in Multi-GPU and DDP, so any suggestions or advises would be very helpful.
Much appreciated.