I’m training a UNet for medical image segmentation. When using 1 Tesla P100 GPU each training epoch takes approximately 2 minutes on ~5000 images with a batch size of 16.
I tried training with 4 P100 GPUs using
model = nn.DataParallel(model) and increasing batch size to 64. This leads to an epoch time of 1 min 30 s, which is not significant for 4x GPUs. I also tried with batch size = 16, but also saw similar times.
This suggests that my bottleneck is somewhere else - perhaps in my data loading and preprocessing. How would I go about checking this bottleneck? Broadly, my dataloading process is: open images with PIL, resize, transform to grayscale tensors, data augmentation (horizontal flip and contrast), normalize.