Time to transform GPU to cpu with .cpu()

Hi guys,
pretty new to PyTorch here. I am running a program with .cuda() data. I need the results on my local MacBook Pro, I want to transform it to cpu with .cpu(). However it is taking a very long time, and it is a simple tensor of dimensions [2048,300,3]. How long does the .cpu() method applied to cuda-like data take?

It shouldn’t take more than a few ms.
Could you post the code?
Is it just something like this:

a = torch.randn(2048, 300, 3).to('cuda')
a = a.to('cpu')

I guess it might have been some problem with the Pycharm integrated development environment… I tried to write new commands on the console, but it kept telling the previous command was still running. Fortunately, I had saved everything in a dictionary, so I reloaded the data and this time the command took only a few seconds.

There is only a slight difference in my initial command:

GRU_100_Epochs = {
    'epochs': 100,
    'loss': loss_b,
    'prediction': y_pred_b,
    'target': Yp_train.cpu(),
    'inputs': Xp_train.cpu()}

and the latter case where I modified the dictionary entry, as follows:

GRU_100_Epochs['target'] = GRU_100_Epochs['target'].cpu()
GRU_100_Epochs['inputs'] = GRU_100_Epochs['inputs'].cpu()

EDIT: Xp_train and Yp_train are the tensor cuda objects.

I am experiencing a similar issue. I load an image to the gpu, apply a model, then transfer the result (which is a 4x61x114 tensor) to the cpu. The first steps take a few milliseconds, the last one around half a minute. Except for the model’s loading, code is as follows:

img = io.imread(path + f).astype(np.float32) / 255
start = time.time()
x = torch.from_numpy(img.transpose(2,0,1)).unsqueeze(0).cuda()
with torch.no_grad():
    y = model(x).squeeze().cpu()
print(time.time() - start)

If I comment out the .cpu() step it takes some milliseconds, otherwise several seconds. What is going on? What can I do about it?

Since CUDA calls are asynchronous, you are basically timing the kernel launch if you remove the cpu() call.
If you want to time CUDA calls, you should synchronize the timer before starting and stopping using:

torch.cuda.synchronize()
start = time.time()
# your calls
torch.cuda.synchronize()
stop = time.time()

I see. Indeed, half a minute is what it takes for the GPU to compute, as can be seen with proper timing. Thanks!