Hello,
I have been using FasterRCNN from Torchvision’s models that uses DistributedDataParallel
for training. However, I find that while using multiple GPU, the memory consumption is far more than without multiple GPU. Here is my code
kwargs = {}
kwargs['min_size'] = args.min_size
kwargs['max_size'] = args.max_size
model = ModifiedFRCNN(cfg=cfg, custom_anchor=args.custom_anchor,
use_def=args.use_def, cpm=args.cpm,
default_filter=args.default_filter,
soft_nms=args.soft_nms,
upscale_r=args.upscale_r, **kwargs).cuda().eval()
model = restore_network(model)
model_without_ddp = model
dataset = GenData(args.test_dataset,
args.base_path,
dataset_param=None,
train=False)
if args.n_gpu > 1:
init_distributed_mode(args)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu],
find_unused_parameters=True)
model_without_ddp = model.module
sampler = torch.utils.data.distributed.DistributedSampler(dataset)
batch_sampler = torch.utils.data.BatchSampler(sampler,
args.batch_size,
drop_last=True)
data_loader = torch.utils.data.DataLoader(dataset,
batch_sampler=batch_sampler,
num_workers=args.num_workers,
collate_fn=coco_collate)
metric_logger = MetricLogger(delimiter=" ")
header = 'Valid:'
batch_iterator = metric_logger.log_every(data_loader, 100, header)
else:
model = model.cuda()
data_loader = iter(data.DataLoader(dataset, args.batch_size, shuffle=False,
num_workers=args.num_workers,
collate_fn=coco_collate))
batch_iterator = iter(data_loader)
ModifiedFRCNN
is a class that inherits FRCNN to make trivial changes, such as parameter, postprocessing etc.
Case 1 : When n_gpu=1
, I am able to use a batch size of upto 8.
Case 2 : When n_gpu=4
, I am unable to even use a batch size of 1.
Both the above mentioned cases are on same the GPU, 2080Ti. Can someone please help me understand what causes this? Here is the command I use to launch the script
python -m torch.distributed.launch --nproc_per_node=4 --use_env test.py <other_arguments> --world_size 4 --n_gpu 4
Thank you and Regards,