Multi-GPU training on COCO

Meisam_Hasani · October 21, 2021, 11:21am

I am trying to train the coco dataset on a 3-gpu system. I put model in torch.nn.DataParallel but the dataloader can not spread the data between the gpus and it uses only the first gpu!

When I load all data into a python list and then in the epochs I read them from that python array, all 3 gpus work fine! It seems to be a problem of dataloader (or cocodetection class) but I don’t have a clue.

What am I missing here for using all the gpus?

ptrblck · October 21, 2021, 10:51pm

nn.DataParallel will clone the model onto al specified devices, split the input batch in dim0, and send each chunk to the corresponding device. It won’t do anything with your data loading. A good description is given in this blogpost. If you want to use load a subset of the data for each device, use DistributedDataParallel with the DistributedSampler, which would also yield a larger speedup.

Meisam_Hasani · October 22, 2021, 7:14am

Thank you ptrblck. Spliting the batch and sending them to multiple devices is all that I need, but it doesn’t do that. In each epoch I have a loop as below:

for images, targets in train_loader:	
        images = list(torchvision.transforms.ToTensor()(image).to(device) for image in images)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        loss_dict = model(images, targets)
        ...

After loading each batch I send them to cuda devices, but it utilizes only the first gpu. If instead of train_loader I use an array (already filled with the data), this code works with all gpus.

ptrblck · October 22, 2021, 7:23am

Could you post a minimal, executable code snippet using random input tensors, which would show this behavior, please?

Meisam_Hasani · October 22, 2021, 7:33am

Sorry I don’t know how to reproduce it with random tensors, but this is exactly what I did (it’s just the training part):
(It won’t give any error and works smoothly but with just one gpu)

device = torch.device("cuda")
device_ids = [i for i in range(torch.cuda.device_count())]

model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
in_features = model.roi_heads.box_predictor.cls_score.in_features
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
													256,
													num_classes)	
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)


model=  torch.nn.DataParallel(model, device_ids=[0, 1, 2])
model.to(device)
model =	model.cuda()

train_data = get_coco(path_to_data, "train", None)  

train_loader = torch.utils.data.DataLoader( train_data 
												,batch_size=batch_size
												,num_workers=num_workers
												,collate_fn=collate_fn
											)
											
											
for epoch in range(num_epochs):
		
		model.train()	
		bj = 0	

		for images, targets in train_loader:	

			images = list(torchvision.transforms.ToTensor()(image).to(device) for image in images)
			
			targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
			
			loss_dict = model(images, targets)
			
			losses = sum(loss for loss in loss_dict.values())
		
			# reduce losses over all GPUs for logging purposes
			loss_dict_reduced = utils.reduce_dict(loss_dict)
			losses_reduced = sum(loss for loss in loss_dict_reduced.values())
		
			loss_value = losses_reduced.item()
		
			if not math.isfinite(loss_value):
				print("Loss is {}, stopping training".format(loss_value))
				print(loss_dict_reduced)
				sys.exit(1)
		
			optimizer.zero_grad()
			losses.backward()
			optimizer.step()

Thanks.

ptrblck · October 22, 2021, 8:08am

Thanks for the code. I don’t see any obvious issues and cannot reproduce the issue in my setup as I see 3 GPUs used:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   37C    P0    95W / 400W |   6416MiB / 40536MiB |     30%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   36C    P0    83W / 400W |   6712MiB / 40536MiB |     31%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   34C    P0    85W / 400W |   6568MiB / 40536MiB |     34%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Meisam_Hasani · October 22, 2021, 8:25am

One thing is strange for me that, the code gets all gpus engaged, because their memories are a little filled up, but not that much that could work!
When I train with one gpu I set batch_size=2 and for 3gpus I should set batch_size=6, but if I set that as 6, the code gives me gpu OOM error which shows that only one gpu is engaged with the data.

[I don’t know if that is related or not but I am using docker container to run the code! with` --gpus all` of course]

Here is the output of my smi: