Nan loss when using DistributedDataParallel

khaled · November 28, 2018, 11:06am

built from source (cuda91 pytorch 1.0 h4c16780_0)
I’m training on one machine with 2 GPUs

I have exactly the same code using either
torch.nn.parallel.DistributedDataParallel or torch.nn.DataParallel
Initialized like this:

        if not config.parallel_mode:
            self.model = self.bare_model
        elif config.parallel_mode == "distributed":
            torch.distributed.init_process_group(backend='nccl',
                                                 world_size=1,rank=0,
                                                 init_method='file://'+config.out_dir+"/shared_file")
            self.model = torch.nn.parallel.DistributedDataParallel(self.bare_model)
        else:
            self.model = torch.nn.DataParallel(self.bare_model)

The lose I’m using is:

nn.CrossEntropyLoss(size_average=True)

Running torch.nn.DataParallel the training process goes smoothly. however, when using torch.nn.parallel.DistributedDataParallel (as initialized above) I get nan loss on the second epoch!

what can be different in DistributedDataParallel from DataParallel?