Curious about why variable.cuda() is slower than gradient descient


I am profiling my training code to detect the performance bottleneck. I found that the variable.cuda() operation takes much more time than doing the actual gradient descent(74.1% vs. 13.6%).

Is there any specific reason for this?


Is there anyone knows?

Did you synchronize?

No, I didn’t choose any option about synchronize.

I mean, you should use torch.cuda.synchronize() to get the “true” time.

oh, got it. Thank you!