Distributed training failed without errors

I just change

https://github.com/ZPdesu/SEAN to distributed training.

And I get a weird error.
It is no error,when I use one gpu.
But, It will stop without printing errors when I use multi gpus.
It will stop after Every 13 epoches by two gpus.

What could be the cause of such a problem?

my environments:
ubuntu: 16.04
cuda: 10.1 / 10.2
pytorch: 1.6.0 / 1.7.0
nccl: 2.4.8 / 2.7.6
python:3.6 / 3.7

1 Like

Hey @Feywell

By “change https://github.com/ZPdesu/SEAN distributed training”, which distributed training API are you referring to (e.g., DistributedDataParallel, c10d, RPC)?

Could you please share the code that uses distributed APIs?

I just use DistributedDataParallel like this:

    if opt.distributed:
        cudnn.benchmark = True
        opt.device = "cuda"


And model :

         if opt.distributed:
            self.pix2pix_model = torch.nn.parallel.DistributedDataParallel(self.pix2pix_model,
            self.pix2pix_model_on_one_gpu = self.pix2pix_model.module

The initialization looks correct to me.

self.pix2pix_model_on_one_gpu = self.pix2pix_model.module

Question: why retrieving the local model from DDP model?

It will stop after Every 13 epoches by two gpus.

You mean the program crashes without any error message? How did you launch the two DDP processes?

This line just be used to save model:

The program will crash in differenct epoches by different number’s gpus without error message.
But It is ok in one gpu.
I use pytorch launch function:
python -m torch.distributed.launch --nproc_per_node=$NGPUS train.py