DistributedDataParallel() Hanfing

gnadaf · November 9, 2020, 8:24pm

I have used DDP in my Transformer model, but when I execute, init_process_group is hanging.

command used: python -m torch.distributed.launch --nnodes=1 --node_rank=1 --nproc_per_node=1 --use_env standard.py

With the above command, my goal is to run my model on a single node with a single GPU.
the system has 8 GPUs, but I would like to use a single GPU, just to make DDP API is working

ptrblck · November 11, 2020, 5:33am

Was DDP working on this machine before and are you able to use e.g. all 8 GPUs or are all calls hanging?

gnadaf · November 19, 2020, 11:01pm

@ptrblck
Issue was with launching processes, now resolved.

Thanks