I measure the performance following the post bellow. Instead of while I have the forward, backward, optimizer step and zeroing the parameter gradients, then I print the time each iteration takes.
I can confirm that the training did not face any problems yet, apart from the slow training 
.
Do you think the cuda version might be causing this? Is there any verbose I can enable for debugging, or if it can print something like apex when there is gradient overflow and is adjusting the scale loss?
Thanks for the help 
.