Feywell
(Feywell)
September 14, 2020, 12:55pm
1
I just change
GitHub - ZPdesu/SEAN: SEAN: Image Synthesis with Semantic Region-Adaptive Normalization (CVPR 2020, Oral) to distributed training.
And I get a weird error.
It is no error,when I use one gpu.
But, It will stop without printing errors when I use multi gpus.
It will stop after Every 13 epoches by two gpus.
What could be the cause of such a problem?
my environments:
ubuntu: 16.04
gpu:nvidia-2080ti
cuda: 10.1 / 10.2
pytorch: 1.6.0 / 1.7.0
nccl: 2.4.8 / 2.7.6
python:3.6 / 3.7
1 Like
mrshenli
(Shen Li)
September 14, 2020, 6:54pm
2
Hey @Feywell
By “change https://github.com/ZPdesu/SEAN distributed training”, which distributed training API are you referring to (e.g., DistributedDataParallel, c10d, RPC)?
Could you please share the code that uses distributed APIs?
Feywell
(Feywell)
September 15, 2020, 2:14am
3
I just use DistributedDataParallel like this:
if opt.distributed:
cudnn.benchmark = True
opt.device = "cuda"
torch.cuda.set_device(opt.local_rank)
torch.distributed.init_process_group(backend="nccl",
init_method="env://")
synchronize()
And model :
if opt.distributed:
self.pix2pix_model = torch.nn.parallel.DistributedDataParallel(self.pix2pix_model,
device_ids=[opt.local_rank],
output_device=opt.local_rank,
find_unused_parameters=True)
self.pix2pix_model_on_one_gpu = self.pix2pix_model.module
mrshenli
(Shen Li)
September 15, 2020, 3:02am
4
The initialization looks correct to me.
self.pix2pix_model_on_one_gpu = self.pix2pix_model.module
Question: why retrieving the local model from DDP model?
It will stop after Every 13 epoches by two gpus.
You mean the program crashes without any error message? How did you launch the two DDP processes?
Feywell
(Feywell)
September 15, 2020, 4:23am
5
This line just be used to save model:
updates the weights of the network while reporting losses
and the latest visuals to visualize the progress in training.
"""
def __init__(self, opt):
self.opt = opt
self.pix2pix_model = Pix2PixModel(opt)
if len(opt.gpu_ids) > 0:
self.pix2pix_model = DataParallelWithCallback(self.pix2pix_model,
device_ids=opt.gpu_ids)
self.pix2pix_model_on_one_gpu = self.pix2pix_model.module
else:
self.pix2pix_model_on_one_gpu = self.pix2pix_model
self.generated = None
if opt.isTrain:
self.optimizer_G, self.optimizer_D = \
self.pix2pix_model_on_one_gpu.create_optimizers(opt)
self.old_lr = opt.lr
def run_generator_one_step(self, data):
The program will crash in differenct epoches by different number’s gpus without error message.
But It is ok in one gpu.
I use pytorch launch function:
python -m torch.distributed.launch --nproc_per_node=$NGPUS train.py