About DistributedDataParallel

lzkzls · June 29, 2020, 6:39am

When I trained my program by 4 GPUs, like this:
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 train.py
there would be an error:

When using nn.DataParallel (), the program can be run. How can I locate this problem?
The snippet of my program is:
dist.init_process_group(backend=‘nccl’, init_method=“env://”)
generator_model = VideoGenerator().cuda(args.local_rank)
discriminator_1 = Discriminator_1().cuda(args.local_rank)
discriminator_2 = Discriminator_2().cuda(args.local_rank)
discriminator _3 = Discriminator_3().cuda(args.local_rank)
generator_model = torch.nn.parallel.DistributedDataParallel(generator_model,device_ids=[args.local_rank])
discriminator_1 = torch.nn.parallel.DistributedDataParallel(discriminator_1,device_ids=[args.local_rank])
discriminator_2 = torch.nn.parallel.DistributedDataParallel(discriminator_2,device_ids=[args.local_rank])
discriminator_3 = torch.nn.parallel.DistributedDataParallel(discriminator_3,device_ids=[args.local_rank])
generator_model.train()
train_sampler = …
train_loader = …
for i, batches in enumerate(train_loader):
first_images, cut_audios = batches
first_images = first_images.float().cuda(args.local_rank)
cut_audios = cut_audios.float().cuda(args.local_rank)
gen_video = generator_model(first_images, cut_audios) #（line 175）
…

maudzung · June 29, 2020, 7:05am

Hi,
You can apply the example from here to your code.

lzkzls · June 29, 2020, 9:55am

Thanks for your reply, I have read the example you offered, but the error still exists in my program.

mrshenli · June 29, 2020, 2:34pm

The error “NoneType object is not callable” seems to suggest that generator_model somehow becomes None? Can you verify that?

Another thing that stands out is the following:

generator_model = torch.nn.parallel.DistributedDataParallel(generator_model,device_ids=[args.local_rank])
discriminator_1 = torch.nn.parallel.DistributedDataParallel(discriminator_1,device_ids=[args.local_rank])
discriminator_2 = torch.nn.parallel.DistributedDataParallel(discriminator_2,device_ids=[args.local_rank])
discriminator_3 = torch.nn.parallel.DistributedDataParallel(discriminator_3,device_ids=[args.local_rank])

Any reason for creating 4 DDP instances in one process?

To help us investigate, it will be helpful if you could share a self-contained minimal reproduce-able example.

BTW, could you please add a “distributed” tag to questions related to distributed training so that people working on that can get back to you promptly?

lzkzls · June 30, 2020, 12:41am

Thank you so much! When I created 4 DDP instances in four different processes using argument process_group, this error disappeared. And could you give me some explanation about the affecting of precess_group？ Why we should create different processes for different DDP instances?

mrshenli · June 30, 2020, 1:08am

Sure, D

Sure. When creating DDP instances in the same process without specifying process_group argument, those DDP instances will share the same default process group. If you use these DDP instances interleavingly, it’s likely to cause communication de-synchronization, which would lead to hang or crash (see this brief explanation of DDP implementation). So if you would like to create multiple DDP instances on the same process for different models, you can try provide different process group instances. You can create new groups using the new_group API.

lzkzls · July 1, 2020, 7:35am

Thank you sincerely! But I have another problem: torch.nn.parallel.DistributedDataParallel() problem about "NoneType Error"\ CalledProcessError\backward, could you give me some advice?

mrshenli · July 1, 2020, 5:36pm

Sure, commented in that post.