How to prevent DistributedDataParallel with nccl backend from creating duplicate processes?

biock · May 20, 2023, 5:23pm

When I run the official demo of PyTorch DistributedDataParallel training (“elastic_ddp.py” from Initialize DDP with torch.distributed.run/torchrun), I found redundant model copies were created on GPU:0.

Are there any solutions to avoid creating extra copies? Thanks!

Environment:

PyTorch: 1.12.1
Python: 3.9.7
cuda: 11.6

Run the script:
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 elastic_ddp.py

nvidia-smi output (redundant copies were found on GPU:0, PID=2754936/2754937/2754938):

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2754935      C   ...opt/miniconda3/bin/python      955MiB |
|    0   N/A  N/A   2754936      C   ...opt/miniconda3/bin/python      653MiB |
|    0   N/A  N/A   2754937      C   ...opt/miniconda3/bin/python      653MiB |
|    0   N/A  N/A   2754938      C   ...opt/miniconda3/bin/python      653MiB |
|    1   N/A  N/A   2754936      C   ...opt/miniconda3/bin/python      911MiB |
|    2   N/A  N/A   2754937      C   ...opt/miniconda3/bin/python     1023MiB |
|    3   N/A  N/A   2754938      C   ...opt/miniconda3/bin/python     1007MiB |

ptrblck · May 20, 2023, 8:00pm

You could set the device via torch.cuda.set_device(rank) and use torch.cuda.current_device or the rank to move the model and data to the corresponding device.