Batch_size parameter of DataLoader when using DDP

My goal is the same as in this question.
Basically I want to train a CNN with batch size 16 and image_size 256. Ssince one of my GPU’s can only handle batch size of 8 with image size of 256 my idea is to split this work between 2 GPU’s so that they split batch size of 16 into 2 batches of 8.
I tried to do this with DataParallel model wrapping with no success as this model inluces additional overheads of memory which my GPU’s can’t handle.
I switched to DistributedDataParallel and I am getting the same memory error RuntimeError: CUDA out of memory. as soon as I try to calculate predictions with y_pred = unet(x).
My way of splitting the dataset is same as in docs:

sampler = DistributedSampler(dataset_train, num_replicas=world_size, rank=rank, shuffle=False, drop_last=False)
dataloader = DataLoader(dataset_train, batch_size=params['batch_size'], pin_memory=pin_memory, num_workers=num_workers, drop_last=False, sampler=sampler)

My questiong is what should the parameter batch_size of DataLoader be?
Should it be the total batch size I want to be splitted (in my case 16) or should it be the batch size for each GPU (in my case 8)?

I tried to find the answer in docs with no luck.

if you DDP within a single server that has two GPUs, then the batch size for each GPU should be 8. DDP runs each batch separately and no output gathers in the default device.

1 Like