Does PyTorch `.to(device)` propagate gradients back to original device?

I am trying to understand how the gradients are propagated back when using loss.backward(). From what I’ve seen, when you call tensor.to('cuda') the returned tensor is not a leaf anymore, so it has no .grad and, theoretically, the gradients should be propagated back to the original tensor in the CPU, which does not seem efficient.

So I want to know what exactly happens in this case when you call loss.backward() and why it is not inefficient, since I see a lot of people using tensor.to(device) instead of specifying device=devicein tensor construction. I also want to know if usingrequires_grad = True` in the GPU tensor to make it a leaf tensor would be better in this case.

Worth a try :slight_smile:

import torch

# torch.Tensor.to(device) approach
x_cpu = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
x_gpu = x_cpu.to('cuda')
y_gpu = x_gpu * 2
loss = y_gpu.sum()
loss.backward()

print("Gradient of x_cpu (original tensor on CPU):", x_cpu.grad)
print("Gradient of x_gpu (tensor on GPU, not a leaf):", x_gpu.grad)

# specify device approach
x_gpu_leaf = torch.tensor([1.0, 2.0, 3.0], requires_grad=True, device='cuda')
y_gpu_leaf = x_gpu_leaf * 2
loss_leaf = y_gpu_leaf.sum()
loss_leaf.backward()

print("Gradient of x_gpu_leaf (leaf tensor on GPU):", x_gpu_leaf.grad)

DO NOT READ THE FOLLOWING COMMENTS BEFORE TRYING THE CODE ABOVE


You are correct ---- gradients will propagates back to CPU if you use .to(device) approach. I’ve also tried to test whether there is a significant gap between these two approaches via the following code:

import torch
import time
import numpy as np

size = (10000, 10000)

# assuming that there is a CUDA device
tensor_cpu = torch.randn(size, requires_grad=True)
tensor_gpu_transfer = tensor_cpu.to("cuda")
tensor_gpu_direct = torch.randn(size, requires_grad=True, device="cuda")

transfer_time_list = []
for _ in range(1000):
    output_transfer = tensor_gpu_transfer.sum()
    start_transfer = time.time()    # this is not the proper way though, I'm using it for convenience
    output_transfer.backward()
    torch.cuda.synchronize()
    backward_time_transfer = time.time() - start_transfer
    transfer_time_list.append(backward_time_transfer)
print(
    "Transfer approach:\n"
    f"Mean: {np.mean(transfer_time_list):.6f}\n"
    f"std: {np.std(transfer_time_list):.6f}\n"
)

direct_time_list = []
for _ in range(1000):
    output_direct = tensor_gpu_direct.sum()
    start_direct = time.time()
    output_direct.backward()
    torch.cuda.synchronize()
    backward_time_direct = time.time() - start_direct
    direct_time_list.append(backward_time_direct)
print(
    "Direct approach:\n"
    f"Mean: {np.mean(direct_time_list):.6f}\n"
    f"std: {np.std(direct_time_list):.6f}\n"
)

Tyr it by yourself, but I will post my results here:

  1. There is a significant gap between these two methods
  2. The larger the size is, the larger the gap is

When size is (1000, 1000), I have:

Transfer approach:
Mean: 0.000594
std: 0.000239

Direct approach:
Mean: 0.000037
std: 0.000187

But if I set size to be (10000, 10000), I have:

Transfer approach:
Mean: 0.134293
std: 0.001979

Direct approach:
Mean: 0.001331
std: 0.000157

So, back to your question: “why it is not inefficient?”
Unfortunately, most likely it is :frowning:
I think there might be a tradeoff between efficiency and convenience/flexibility or simply common practice. Generally speaking, if you use .to(device), it’s basically a one-liner and will, in theory, cause the least bugs.


Note:
My initial idea is that the transfer process should be an asynchronized process so the difference should be small. However, my test results are telling a different story. I would be glad if anyone could dive deeper and explain how PyTorch handles these things. I’m not familiar with this part…

Based on your code you know that your profiling might be invalid as it accumulates previously launched kernels on the first iterations so you could fix this.
A full profiler timeline could also give you more information about the launched operations and their overhead.

The majority of the use cases don’t compute gradients in the input tensor. Moving the input batch inside the DataLoader loop to the GPU is fine as usually only the model parameters will receive gradients, not the input.

Thank you both for the answers, you both helped me clear out some ideas.

What do you mean specifically by this? do you mean the fact that it sends the kernels in the first iteration, making it take longer than the other iterations? How could one profile the part of sending the kernels separately?

I would also like to know if there is anything in the documentation or somewhere else explaining this pipeline, since it is still kind of a black box to me. For example, I don’t know exactly how the process of scheduling the functions to the GPU work and the best practices for preventing bottleneck from blocking it by needing to send data back and forth between CPU and GPU.

This actually sounds pretty obvious now, since the variables are just the parameters, not the data. I guess my confusion came from the fact that the examples I saw propagated back to all input tensors. Thank you.