Training time does not decrease by half even when doubling the number of GPUs

Hello PyTorch Forum,

I am training a Transformer-based action recognition model and using Slurm for distributed processing. When I trained the model with 4 GPUs, it took 4 hours and 40 minutes per epoch. However, I expected the training time to be reduced by half, to 2 hours and 20 minutes, when I trained the model with 8 GPUs. However, the training time for each epoch is 3 hours and 30 minutes, which is even longer than half of the original time. I also doubled the CPU and memory along with the GPUs.

The issue I anticipate is the bottleneck in the video loader. However, I don’t understand why doubling both the CPU cores and GPU doesn’t cut the time in half.

Why is this happening? What should I consider to reduce the training time?

You could profile your code using the native profiler or e.g. Nsight Systems and compare both systems to check where the bottleneck in your training is and e.g. if the GPUs are starving due to a slow data loading or processing.