I am doing video classification using optical flow. My training pipeline is a simple two stage networks:
[input video] → RAFT model (inference only) → [optical flow] → Inception 3D model → [class label]
I have access to two A100 80GB GPUs. But A batch size of 32 can fit in the memory of one GPU. So there is no need to use 2 GPUs since I am not constrained by GPU memory. However, I notice that using the same batch size of 32 but with two GPUs (so that each GPU batch size is 16 rather than 32) and wrapping both the RAFT and the Inception 3D models in DataParallel doubles the speed. The bottleneck in my pipeline seems to be the RAFT model (most time-consuming).
To summarize, here are the two cases:
case 1:
One GPU (of A100), batch size = 32, time per epoch= 4 hours
case 2:
Two GPUs (of A100), batch size = 32 (16 each), time per epoch= 2 hours
Any idea why case 1 is twice as slow as case 2 even though the batch size in both cases is the same? What am I missing?