About DistributedDataParallel

When I trained my program by 4 GPUs, like this:
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 train.py
there would be an error:


When using nn.DataParallel (), the program can be run. How can I locate this problem?
The snippet of my program is:
dist.init_process_group(backend=‘nccl’, init_method=“env://”)
generator_model = VideoGenerator().cuda(args.local_rank)
discriminator_1 = Discriminator_1().cuda(args.local_rank)
discriminator_2 = Discriminator_2().cuda(args.local_rank)
discriminator _3 = Discriminator_3().cuda(args.local_rank)
generator_model = torch.nn.parallel.DistributedDataParallel(generator_model,device_ids=[args.local_rank])
discriminator_1 = torch.nn.parallel.DistributedDataParallel(discriminator_1,device_ids=[args.local_rank])
discriminator_2 = torch.nn.parallel.DistributedDataParallel(discriminator_2,device_ids=[args.local_rank])
discriminator_3 = torch.nn.parallel.DistributedDataParallel(discriminator_3,device_ids=[args.local_rank])
generator_model.train()
train_sampler = …
train_loader = …
for i, batches in enumerate(train_loader):
first_images, cut_audios = batches
first_images = first_images.float().cuda(args.local_rank)
cut_audios = cut_audios.float().cuda(args.local_rank)
gen_video = generator_model(first_images, cut_audios) #(line 175)

Hi,
You can apply the example from here to your code.

Thanks for your reply, I have read the example you offered, but the error still exists in my program.

The error “NoneType object is not callable” seems to suggest that generator_model somehow becomes None? Can you verify that?

Another thing that stands out is the following:

generator_model = torch.nn.parallel.DistributedDataParallel(generator_model,device_ids=[args.local_rank])
discriminator_1 = torch.nn.parallel.DistributedDataParallel(discriminator_1,device_ids=[args.local_rank])
discriminator_2 = torch.nn.parallel.DistributedDataParallel(discriminator_2,device_ids=[args.local_rank])
discriminator_3 = torch.nn.parallel.DistributedDataParallel(discriminator_3,device_ids=[args.local_rank])

Any reason for creating 4 DDP instances in one process?

To help us investigate, it will be helpful if you could share a self-contained minimal reproduce-able example.

BTW, could you please add a “distributed” tag to questions related to distributed training so that people working on that can get back to you promptly?

Thank you so much! When I created 4 DDP instances in four different processes using argument process_group, this error disappeared. And could you give me some explanation about the affecting of precess_group? Why we should create different processes for different DDP instances?

Sure, D

Sure. When creating DDP instances in the same process without specifying process_group argument, those DDP instances will share the same default process group. If you use these DDP instances interleavingly, it’s likely to cause communication de-synchronization, which would lead to hang or crash (see this brief explanation of DDP implementation). So if you would like to create multiple DDP instances on the same process for different models, you can try provide different process group instances. You can create new groups using the new_group API.

Thank you sincerely! But I have another problem: torch.nn.parallel.DistributedDataParallel() problem about "NoneType Error"\ CalledProcessError\backward, could you give me some advice?

Sure, commented in that post.