When I run the official demo of PyTorch DistributedDataParallel
training (“elastic_ddp.py” from Initialize DDP with torch.distributed.run/torchrun), I found redundant model copies were created on GPU:0.
Are there any solutions to avoid creating extra copies? Thanks!
Environment:
- PyTorch: 1.12.1
- Python: 3.9.7
- cuda: 11.6
Run the script:
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 elastic_ddp.py
nvidia-smi
output (redundant copies were found on GPU:0, PID=2754936/2754937/2754938):
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2754935 C ...opt/miniconda3/bin/python 955MiB |
| 0 N/A N/A 2754936 C ...opt/miniconda3/bin/python 653MiB |
| 0 N/A N/A 2754937 C ...opt/miniconda3/bin/python 653MiB |
| 0 N/A N/A 2754938 C ...opt/miniconda3/bin/python 653MiB |
| 1 N/A N/A 2754936 C ...opt/miniconda3/bin/python 911MiB |
| 2 N/A N/A 2754937 C ...opt/miniconda3/bin/python 1023MiB |
| 3 N/A N/A 2754938 C ...opt/miniconda3/bin/python 1007MiB |