Worth a try 
import torch
# torch.Tensor.to(device) approach
x_cpu = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
x_gpu = x_cpu.to('cuda')
y_gpu = x_gpu * 2
loss = y_gpu.sum()
loss.backward()
print("Gradient of x_cpu (original tensor on CPU):", x_cpu.grad)
print("Gradient of x_gpu (tensor on GPU, not a leaf):", x_gpu.grad)
# specify device approach
x_gpu_leaf = torch.tensor([1.0, 2.0, 3.0], requires_grad=True, device='cuda')
y_gpu_leaf = x_gpu_leaf * 2
loss_leaf = y_gpu_leaf.sum()
loss_leaf.backward()
print("Gradient of x_gpu_leaf (leaf tensor on GPU):", x_gpu_leaf.grad)
DO NOT READ THE FOLLOWING COMMENTS BEFORE TRYING THE CODE ABOVE
You are correct ---- gradients will propagates back to CPU if you use .to(device)
approach. I’ve also tried to test whether there is a significant gap between these two approaches via the following code:
import torch
import time
import numpy as np
size = (10000, 10000)
# assuming that there is a CUDA device
tensor_cpu = torch.randn(size, requires_grad=True)
tensor_gpu_transfer = tensor_cpu.to("cuda")
tensor_gpu_direct = torch.randn(size, requires_grad=True, device="cuda")
transfer_time_list = []
for _ in range(1000):
output_transfer = tensor_gpu_transfer.sum()
start_transfer = time.time() # this is not the proper way though, I'm using it for convenience
output_transfer.backward()
torch.cuda.synchronize()
backward_time_transfer = time.time() - start_transfer
transfer_time_list.append(backward_time_transfer)
print(
"Transfer approach:\n"
f"Mean: {np.mean(transfer_time_list):.6f}\n"
f"std: {np.std(transfer_time_list):.6f}\n"
)
direct_time_list = []
for _ in range(1000):
output_direct = tensor_gpu_direct.sum()
start_direct = time.time()
output_direct.backward()
torch.cuda.synchronize()
backward_time_direct = time.time() - start_direct
direct_time_list.append(backward_time_direct)
print(
"Direct approach:\n"
f"Mean: {np.mean(direct_time_list):.6f}\n"
f"std: {np.std(direct_time_list):.6f}\n"
)
Tyr it by yourself, but I will post my results here:
- There is a significant gap between these two methods
- The larger the
size
is, the larger the gap is
When size
is (1000, 1000)
, I have:
Transfer approach:
Mean: 0.000594
std: 0.000239
Direct approach:
Mean: 0.000037
std: 0.000187
But if I set size
to be (10000, 10000)
, I have:
Transfer approach:
Mean: 0.134293
std: 0.001979
Direct approach:
Mean: 0.001331
std: 0.000157
So, back to your question: “why it is not inefficient?”
Unfortunately, most likely it is 
I think there might be a tradeoff between efficiency and convenience/flexibility or simply common practice. Generally speaking, if you use .to(device)
, it’s basically a one-liner and will, in theory, cause the least bugs.
Note:
My initial idea is that the transfer process should be an asynchronized process so the difference should be small. However, my test results are telling a different story. I would be glad if anyone could dive deeper and explain how PyTorch handles these things. I’m not familiar with this part…