When training a 3D CNN model on a 4-tesla-v100 node with mixed precision, I got a wired result:
The mixed-precision (O1 && O3) result is slower than the FP32 result when batch size is 1 using time.time() for recording the execution time.
Using the torch.profiler(), it displays that the mixed-precision indeed speeds up the training in terms of the CPU time and the CUDA time.
Notably, the problem only exists when batch size is equal to 1 (batch size = 4 is accelerated as predicted) and I tried two scales of the 3D CNN models. (The large model can only be trained with batch size =1 for it is too large.)
It seems that there exists a large portion of the execution time that is not related to computing.
Do you have any idea about it?
Why the total execution of 3D CNN in mixed precision is slower than the FP32 when batch size =1？
pytorch:1.5 && 1.3