How did you measure the transfer time?
Note that CUDA calls are asynchronous, so that the data transfer and processing will be performed in the background.
If you want to time the transfer time, you should synchronize the calls with: torch.cuda.synchronize()
:
a = torch.randn(...)
torch.cuda.synchronize()
t0 = time.time()
a = a.to('cuda:0')
torch.cuda.synchronize()
t1 = time.time()