Method for efficiently transferring non-autograd tensors to CPU from GPU?

henryald · March 19, 2025, 4:26pm

My code generates high dimensional Tensor objects on the GPU and subsequently stores these tensors for computation later on. However, there is a large possibility of the GPU running out of memory to store these samples. I have seen the following solution Modifying forward/backward pass - #2 by ptrblck a few times on the forum, but this solution seems specific to autograd objects, where a graph is involved. However, I am not tracking any gradients.

I could call .cpu() on every sample I store, but this is extremely slow e.g. my tests show the addition of .cpu() slows the code down by up to 10x. I could implement my own cache, which stores a pre-defined number of samples on the GPU before moving them to the CPU, but I was wondering if PyTorch already has its own framework for this.

Thanks.

ptrblck · March 20, 2025, 3:44pm

Your profiling code is most likely wrong as you might not be synchronizing the code before starting and stopping host timers. The cpu() call will implicitly synchronize your code and will thus accumulate the runtime of already scheduled kernels.

henryald · March 28, 2025, 11:58am

Sorry, I’m not sure I follow. The profiling is as follows:

import torch
import time

gpu = "mps"

def test_gpu(n_samples, *tensor_size):
    obj = torch.zeros((int(n_samples), *tensor_size), device=gpu)
    a = torch.randn(*tensor_size, device=gpu)
    start = time.time()
    for i in range(int(n_samples)):
        obj[i] = a.clone()
    end = time.time()

    print((end - start) / n_samples)

def test_cpu(n_samples, *tensor_size):
    obj = torch.zeros((int(n_samples), *tensor_size), device="cpu")
    a = torch.randn(*tensor_size, device=gpu)
    start = time.time()
    for i in range(int(n_samples)):
        obj[i] = a.clone().cpu()
    end = time.time()

    print((end - start) / n_samples)

test_gpu(1e6, 64, 64)
test_cpu(1e6, 64, 64)

However, the extent to which calling .cpu() slows down the code is system dependent: on mps the code could be 10-100x slower while on a cuda system it is only 2x slower to call .cpu(). But it is slower nonetheless.

ptrblck · March 28, 2025, 1:01pm

Because this call implicitly synchronized your code. Add torch.cuda.synchronize() before starting and stopping host timers to measure the kernel execution time alone.

neoncube · April 9, 2025, 8:44am

@henryald Alternatively, you can use torch.profiler, which does the syncing for you

For example: Model() uses GPU but backwards() doesn't - #3 by neoncube