Hello PyTorch Forum,
I am training a Transformer-based action recognition model and using Slurm for distributed processing. When I trained the model with 4 GPUs, it took 4 hours and 40 minutes per epoch. However, I expected the training time to be reduced by half, to 2 hours and 20 minutes, when I trained the model with 8 GPUs. However, the training time for each epoch is 3 hours and 30 minutes, which is even longer than half of the original time. I also doubled the CPU and memory along with the GPUs.
The issue I anticipate is the bottleneck in the video loader. However, I don’t understand why doubling both the CPU cores and GPU doesn’t cut the time in half.
Why is this happening? What should I consider to reduce the training time?