Why copy data to gpu so slow?

In training phase, I send batch tensor to main gpu cost about 0.07s, however, in evaluation phase, it only cost 0.005s, the data size is almost same, I don’t know why…

Could you post the code you’ve used to time these operations?
Note that CUDA operations are asynchronous in PyTorch, so that you should synchronize before starting and stopping the timer:

torch.cuda.synchronize()
t0 = time.time()
# your operation
torch.cuda.synchronize()
t1 = time.time()

If you didn’t add these sync points, you might just time the initial CUDA context creation etc.

1 Like