I measure the performance following the post bellow. Instead of while
I have the forward, backward, optimizer step and zeroing the parameter gradients, then I print the time each iteration takes.
I can confirm that the training did not face any problems yet, apart from the slow training .
Do you think the cuda version might be causing this? Is there any verbose I can enable for debugging, or if it can print something like apex when there is gradient overflow and is adjusting the scale loss?
Thanks for the help .