built from source (cuda91 pytorch 1.0 h4c16780_0)
I’m training on one machine with 2 GPUs
I have exactly the same code using either
torch.nn.parallel.DistributedDataParallel or torch.nn.DataParallel
Initialized like this:
if not config.parallel_mode:
self.model = self.bare_model
elif config.parallel_mode == "distributed":
torch.distributed.init_process_group(backend='nccl',
world_size=1,rank=0,
init_method='file://'+config.out_dir+"/shared_file")
self.model = torch.nn.parallel.DistributedDataParallel(self.bare_model)
else:
self.model = torch.nn.DataParallel(self.bare_model)
The lose I’m using is:
nn.CrossEntropyLoss(size_average=True)
Running torch.nn.DataParallel the training process goes smoothly. however, when using torch.nn.parallel.DistributedDataParallel (as initialized above) I get nan loss on the second epoch!
what can be different in DistributedDataParallel from DataParallel?