Time to calculate criterion / MSE

I’ve analysed the run times during in training script, and it turns out (if I’ve done everything right), that calculating the criterion (torch.nn.MSELoss()) takes most of the time (factor 10 compared to forward() ).

Is there anything I can do to improve the runtime ?

OK found the issue.
Apparently the gpu still works in the background and measuring your runtime only works correctly if you use torch.cuda.synchronize() :