Doing QR decomposition on GPU is much slower than on CPU

How did you measure the transfer time?
Note that CUDA calls are asynchronous, so that the data transfer and processing will be performed in the background.
If you want to time the transfer time, you should synchronize the calls with: torch.cuda.synchronize():

a = torch.randn(...)
torch.cuda.synchronize()
t0 = time.time()
a = a.to('cuda:0')
torch.cuda.synchronize()
t1 = time.time()
2 Likes