Hi,
I have a well working single-GPU script that I (believe) I correctly adapted to use DDP
I tried to use it in a single-node, 4-GPU EC2 with 2 different techniques, both hang forever (1min+) with CPU and GPU idle. What is wrong? How to use DDP?
python -m torch.distributed.launch --use_env train.py \
--gpu-count 4 \
--dataset . \
--cache tmp \
--height 604 \
--width 960 \
--checkpoint-dir . \
--batch 10 \
--workers 24 \
--log-freq 20 \
--prefetch 2 \
--bucket $bucket \
--eval-size 10 \
--iterations 20 \
--class-list a2d2_images/camera_lidar_semantic/class_list.json
hangs
python /home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/distributed/launch.py \
train.py \
--gpu-count 4 \
--dataset . \
--cache tmp \
--height 604 \
--width 960 \
--checkpoint-dir . \
--batch 10 \
--workers 24 \
--log-freq 20 \
--prefetch 2 \
--bucket $bucket \
--eval-size 10 \
--iterations 20 \
--class-list a2d2_images/camera_lidar_semantic/class_list.json
hangs too.
I strongly suggest the PyTorch team to work on improve the distribution experience. As model and dataset scale, this is a feature people will use more and more