Insignificant speed up with Dataparallel for multiple GPU training

Hi PyTorchers,

I’m training a UNet for medical image segmentation. When using 1 Tesla P100 GPU each training epoch takes approximately 2 minutes on ~5000 images with a batch size of 16.

I tried training with 4 P100 GPUs using model = nn.DataParallel(model) and increasing batch size to 64. This leads to an epoch time of 1 min 30 s, which is not significant for 4x GPUs. I also tried with batch size = 16, but also saw similar times.

This suggests that my bottleneck is somewhere else - perhaps in my data loading and preprocessing. How would I go about checking this bottleneck? Broadly, my dataloading process is: open images with PIL, resize, transform to grayscale tensors, data augmentation (horizontal flip and contrast), normalize.

What do you see in nvidia-smi -l 1 regarding GPU utilization? If it’s significantly less than 100%, then you have a bottleneck before the GPUs, which is indeed probable.

How many workers do you use in the DataLoader? The number may vary depending on your configuration and preprocessing, but I’ve used 4x the number of GPUs with optimal performance consistently, this might be a starting point…

What exactly do you mean by GPU utilization? In terms of memory each GPU consistently uses ~9.2GB/12GB during training, but the ‘Volatile GPU-util.’ parameter varies for each GPU between 0% to 50-100% every other second.
I was using num_workers = 8, I also tried using 16 and 24 but with no significant differences in epoch times. If it helps I’m using a remote node with 24 cores, 125GB of memory and 4 GPUs as mentioned before.

I was indeed speaking of the Volatile-GPU-util. percentage! The GPU memory is hard to fill without causing some errors, even though it’s possible to go higher than that by playing with the batch size, but I’d say 9.2/12 is good usage.

Regarding the utilization, a drop to 0% every other second is not so good, and especially if it’s as low as 50% the other second…

But if increasing workers does not help, I would investigate the loading / preprocessing (preprocessing first). Try disabling data augmentation and check the time before/after. (Also you might want to print processing time for a full mini-batch as well as only the loading part (i.e. after you get into the for loop iterating on the data loader). This helps to see if most of the time is spent loading / preprocessing data or after that (network, loss computing etc.).

Thanks for your help - I will do that! How would I go about making the data loading faster?

I managed to get the time per epoch down to 1 minute, which is 2x speed up, by removing a portion of the preprocessing. Realistically how much of a speed up should I expect with 4x GPUs if the data loading and preprocessing are minimal?

Good to hear! I can’t say for sure about the expected speedup, but very likely more than 2x but a bit less than 4x…

Now for speeding the data loading, there are many approaches that depend on the use case… I’ll list a few:

  • for images, the compression (and therefore decompression on reading the image) algorithm can make a big difference, but there’s a quality tradeoff (e.g. JPEG vs. PNG, JPEG will be faster and also use less disk space),
  • sequential reads are much much faster than random, so if you can safely disable shuffling, you will have a huge speedup (but then be careful about bias, class imbalance and other aspects),
  • preprocessing takes time, as you saw, and if you can do it offline before each training session, you will also gain time. However, it removes randomness from the training.

There, you have some ideas about how to speed up the training. They’re all quite consequent in terms of preprocessing before training, but if you plan on training many times on the same data, it can save you a lot of time in the long run and make full use of the costly GPUs!

1 Like

Thanks a lot for your help!

1 Like