DDP script hangs forever, doesn't run

Hi,

I have a well working single-GPU script that I (believe) I correctly adapted to use DDP

I tried to use it in a single-node, 4-GPU EC2 with 2 different techniques, both hang forever (1min+) with CPU and GPU idle. What is wrong? How to use DDP?

python -m torch.distributed.launch --use_env train.py \
    --gpu-count 4 \
    --dataset . \
    --cache tmp \
    --height 604 \
    --width 960 \
    --checkpoint-dir . \
    --batch 10 \
    --workers 24 \
    --log-freq 20 \
    --prefetch 2 \
    --bucket $bucket \
    --eval-size 10 \
    --iterations 20 \
    --class-list a2d2_images/camera_lidar_semantic/class_list.json

hangs

python /home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/distributed/launch.py \
    train.py \
    --gpu-count 4 \
    --dataset . \
    --cache tmp \
    --height 604 \
    --width 960 \
    --checkpoint-dir . \
    --batch 10 \
    --workers 24 \
    --log-freq 20 \
    --prefetch 2 \
    --bucket $bucket \
    --eval-size 10 \
    --iterations 20 \
    --class-list a2d2_images/camera_lidar_semantic/class_list.json

hangs too.

I strongly suggest the PyTorch team to work on improve the distribution experience. As model and dataset scale, this is a feature people will use more and more

It is very hard to root cause your problem by just looking at the commands you have run. I suggest checking out our debugging tools described here. They might give you more context about the problem.

Also make sure that you read our documentation on torchrun which is the officially recommended way to start distributed jobs.

DDP default timeout is 30min. Adjusting it through init_process_group would not resolve the problem but would allow to fail faster