Excessive memory consumption while using DistributedDataParallel

Hello,

I have been using FasterRCNN from Torchvision’s models that uses DistributedDataParallel for training. However, I find that while using multiple GPU, the memory consumption is far more than without multiple GPU. Here is my code

  kwargs = {}
  kwargs['min_size'] = args.min_size
  kwargs['max_size'] = args.max_size
  model = ModifiedFRCNN(cfg=cfg, custom_anchor=args.custom_anchor,
                        use_def=args.use_def, cpm=args.cpm,
                        default_filter=args.default_filter,
                        soft_nms=args.soft_nms,
                        upscale_r=args.upscale_r, **kwargs).cuda().eval()
  model = restore_network(model)
  model_without_ddp = model
  dataset = GenData(args.test_dataset,
                    args.base_path,
                    dataset_param=None,
                    train=False)

  if args.n_gpu > 1:
      init_distributed_mode(args)
      model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu],
                                                        find_unused_parameters=True)
      model_without_ddp = model.module
      sampler = torch.utils.data.distributed.DistributedSampler(dataset)
      batch_sampler = torch.utils.data.BatchSampler(sampler,
                                                    args.batch_size,
                                                    drop_last=True)
      data_loader = torch.utils.data.DataLoader(dataset,
                                                batch_sampler=batch_sampler,
                                                num_workers=args.num_workers,
                                                collate_fn=coco_collate)
      metric_logger = MetricLogger(delimiter="  ")
      header = 'Valid:'
      batch_iterator = metric_logger.log_every(data_loader, 100, header)
  else:
      model = model.cuda()
      data_loader =  iter(data.DataLoader(dataset, args.batch_size, shuffle=False,
                                                num_workers=args.num_workers,
                                                collate_fn=coco_collate))
      batch_iterator = iter(data_loader)

ModifiedFRCNN is a class that inherits FRCNN to make trivial changes, such as parameter, postprocessing etc.
Case 1 : When n_gpu=1, I am able to use a batch size of upto 8.
Case 2 : When n_gpu=4, I am unable to even use a batch size of 1.

Both the above mentioned cases are on same the GPU, 2080Ti. Can someone please help me understand what causes this? Here is the command I use to launch the script

python -m torch.distributed.launch --nproc_per_node=4 --use_env test.py <other_arguments> --world_size 4 --n_gpu 4

Thank you and Regards,