Pytorch DistributedDataParallel training

Shia_Edwards · March 16, 2023, 7:46am

code: GNGAN-PyTorch/train_ddp.py at master · basiclab/GNGAN-PyTorch · GitHub

train_ddp.py is optimized for multi-gpu training, e.g.,
CUDA_VISIBLE_DEVICES=0,1,2,3 python train_ddp.py
–flagfile ./config/GN-GAN_CELEBAHQ256_RES.txt

colab seems to have only one gpu but it doesn’t matter. Whether you write 4 in this way or just one cuda=0 will be a problem.

error list：
The server socket has failed to bind to [::]:55556 (errno: 98 - Address already in use).
Process Process-2:
Process Process-3:
Process Process-4:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=4, worker_count=10, timeout=0:00:30)

or like this

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

After searching a lot in google, it seems that none of them can solve the problem, apparently it is a problem of parallel training (ddp). And google colab compatibility should be the best, s so I have been debugging on google colab.

The following code seems to have worked before, but it was used on the cifar10 and stl10 datasets, maybe the celeba dataset is larger? And using parallel computing? So I do not know how to do.
!export CUDA_VISIBLE_DEVICES=0,1,2,3
!python train_ddp.py --flagfile .[/config/GN-GAN_CELEBAHQ256_RES.txt]