When training a 3D CNN model on a 4-tesla-v100 node with mixed precision, I got a wired result:
The mixed-precision (O1 && O3) result is slower than the FP32 result when batch size is 1 using time.time() for recording the execution time.
Using the torch.profiler(), it displays that the mixed-precision indeed speeds up the training in terms of the CPU time and the CUDA time.
Notably, the problem only exists when batch size is equal to 1 (batch size = 4 is accelerated as predicted) and I tried two scales of the 3D CNN models. (The large model can only be trained with batch size =1 for it is too large.)
It seems that there exists a large portion of the execution time that is not related to computing.
Do you have any idea about it?
Why the total execution of 3D CNN in mixed precision is slower than the FP32 when batch size =1？
pytorch:1.5 && 1.3
If you are manually timing CUDA operations, you would need to synchronize the code before starting and stopping the timer via
Also, the first CUDA operation will create the CUDA context etc. and will be slower than the following calls, so you should add some warmup iterations.
I have used torch.cuda.synchronize(), and there is about hundreds of iterations in one epoch with the same problem.
I will provide you the detailed results soon.
FP32 training without apex:
FP32 training with apex O0:
mixed precision training with apex O1:
Both the forward and backward passes seem to see a speedup between
Are you seeing that the complete training time in
O1 is still higher than in
this is not related to quantization, can we have a mixed recision/amp tag?
That sounds like a good idea!
The main concern is that these two kinds of profiling methods produce very different results.