How to prevent DistributedDataParallel with nccl backend from creating duplicate processes?

When I run the official demo of PyTorch DistributedDataParallel training (“elastic_ddp.py” from Initialize DDP with torch.distributed.run/torchrun), I found redundant model copies were created on GPU:0.

Are there any solutions to avoid creating extra copies? Thanks!


Environment:

  • PyTorch: 1.12.1
  • Python: 3.9.7
  • cuda: 11.6

Run the script:
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 elastic_ddp.py

nvidia-smi output (redundant copies were found on GPU:0, PID=2754936/2754937/2754938):

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2754935      C   ...opt/miniconda3/bin/python      955MiB |
|    0   N/A  N/A   2754936      C   ...opt/miniconda3/bin/python      653MiB |
|    0   N/A  N/A   2754937      C   ...opt/miniconda3/bin/python      653MiB |
|    0   N/A  N/A   2754938      C   ...opt/miniconda3/bin/python      653MiB |
|    1   N/A  N/A   2754936      C   ...opt/miniconda3/bin/python      911MiB |
|    2   N/A  N/A   2754937      C   ...opt/miniconda3/bin/python     1023MiB |
|    3   N/A  N/A   2754938      C   ...opt/miniconda3/bin/python     1007MiB |

You could set the device via torch.cuda.set_device(rank) and use torch.cuda.current_device or the rank to move the model and data to the corresponding device.

1 Like