Time to calculate criterion / MSE

OK found the issue.
Apparently the gpu still works in the background and measuring your runtime only works correctly if you use torch.cuda.synchronize() :