Distributed training failed without errors

Feywell · September 14, 2020, 12:55pm

I just change

GitHub - ZPdesu/SEAN: SEAN: Image Synthesis with Semantic Region-Adaptive Normalization (CVPR 2020, Oral) to distributed training.

And I get a weird error.
It is no error，when I use one gpu.
But, It will stop without printing errors when I use multi gpus.
It will stop after Every 13 epoches by two gpus.

What could be the cause of such a problem？

my environments:
ubuntu: 16.04
gpu：nvidia-2080ti
cuda: 10.1 / 10.2
pytorch: 1.6.0 / 1.7.0
nccl: 2.4.8 / 2.7.6
python:3.6 / 3.7

mrshenli · September 14, 2020, 6:54pm

Hey @Feywell

By “change https://github.com/ZPdesu/SEAN distributed training”, which distributed training API are you referring to (e.g., DistributedDataParallel, c10d, RPC)?

Could you please share the code that uses distributed APIs?

Feywell · September 15, 2020, 2:14am

I just use DistributedDataParallel like this:

    if opt.distributed:
        cudnn.benchmark = True
        opt.device = "cuda"

        torch.cuda.set_device(opt.local_rank)
        torch.distributed.init_process_group(backend="nccl",
                                             init_method="env://") 
 
        synchronize()

And model :

         if opt.distributed:
            self.pix2pix_model = torch.nn.parallel.DistributedDataParallel(self.pix2pix_model,
                                                              device_ids=[opt.local_rank],
                                                              output_device=opt.local_rank,
                                                              find_unused_parameters=True)
            self.pix2pix_model_on_one_gpu = self.pix2pix_model.module

mrshenli · September 15, 2020, 3:02am

The initialization looks correct to me.

self.pix2pix_model_on_one_gpu = self.pix2pix_model.module

Question: why retrieving the local model from DDP model?

It will stop after Every 13 epoches by two gpus.

You mean the program crashes without any error message? How did you launch the two DDP processes?

Feywell · September 15, 2020, 4:23am

This line just be used to save model:

github.com

ZPdesu/SEAN/blob/04c7536ff3fecd2d1a09c9ae046a1144636033a5/trainers/pix2pix_trainer.py#L23


updates the weights of the network while reporting losses
and the latest visuals to visualize the progress in training.
"""

def __init__(self, opt):
    self.opt = opt
    self.pix2pix_model = Pix2PixModel(opt)
    if len(opt.gpu_ids) > 0:
        self.pix2pix_model = DataParallelWithCallback(self.pix2pix_model,
                                                      device_ids=opt.gpu_ids)
        self.pix2pix_model_on_one_gpu = self.pix2pix_model.module
    else:
        self.pix2pix_model_on_one_gpu = self.pix2pix_model

    self.generated = None
    if opt.isTrain:
        self.optimizer_G, self.optimizer_D = \
            self.pix2pix_model_on_one_gpu.create_optimizers(opt)
        self.old_lr = opt.lr

def run_generator_one_step(self, data):

The program will crash in differenct epoches by different number’s gpus without error message.
But It is ok in one gpu.
I use pytorch launch function:
python -m torch.distributed.launch --nproc_per_node=$NGPUS train.py