Issue with DataParallel model

I am trying to use 2 GPU’s for training.

I followed this tutorial and tested 1 GPU with batch size of 8 and image size of 512 and got the output:

In Model: input size torch.Size([8, 1, 512, 512]) output size torch.Size([8, 1, 512, 512])

This is maximum capacity for 1 of my GPU’s. My idea is to train on 2 GPU’s and increase the batch size to 16 so each of my GPU’s can get batches of 8.
When I try to run this setup about 10 iterations run smooth with expected output

In Model: input size torch.Size([8, 1, 512, 512]) output size torch.Size([8, 1, 512, 512])
In Model: input size torch.Size([8, 1, 512, 512]) output size torch.Size([8, 1, 512, 512])

but then I get an error:

RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 7.80 GiB total capacity; 5.51 GiB already allocated; 191.31 MiB free;

This error appears when loss.backward() is called during the training process.
This is puzzling me since by my knowledge there should be enough VRAM in one card to withstand batch size of 8 with image size of 512.
Is there some extra memory needed when using 2 GPU’s that I am not aware of so this task is not so straightforward?

Just to mention I have tested running my training on 2 GPUs with batch size 8 and image size 512 and everything works fine: GPU’s split the batch and each gets 4 images with output:

In Model: input size torch.Size([4, 1, 512, 512]) output size torch.Size([4, 1, 512, 512])
In Model: input size torch.Size([4, 1, 512, 512]) output size torch.Size([4, 1, 512, 512])

Why wouldn’t this work when I increase batch size to 16?

Yes, nn.DataParallel is usually creating a memory imbalance and will increase the memory usage on the default device as it’s used to store all inputs, all outputs, and to calculate the loss.
We thus recommend to use DistributedDataParallel to avoid this overhead and to get a better performance compared to DataParallel.

1 Like