How to Accelerate Transferring Tensors from GPU to CPU

Hello, I have constructed a PyTorch inference program that loads a data file and splits the data into batches. A pre-trained model (created by myself) then processes the data list. Every processed batch is added to one of four lists of tensors. I execute the following script to re-assemble the batches:

    # Assemble batches.
    print("Assembling batches...", end="\r")
    AOutputs = torch.cat(AOutputs, dim=0).cpu()  # Re-assemble AOutputs tensor list into single tensor.
    BOutputs = torch.cat(BOutputs, dim=0).cpu()
    DOutputs = torch.cat(DOutputs, dim=0).cpu()
    EOutputs = torch.cat(EOutputs, dim=0).cpu()
    print("Assembling batches... Done.")

When I run my inference program using the CPU, the re-assembly process is substantially rapid, taking less than 10 seconds. However, when I use my GPU for inference, batch assembly takes significantly longer - perhaps about a minute. Is there any way to eliminate the sizable delay by accelerating the GPU to CPU transferring and communication?

Did you check which part of the code (the torch.cat operation or the data transfer) takes the majority of the time? Assuming you are concatenating tensors on the GPU without any previous execution, your cache might need to increase thus calling into synchronizing and expensive cudaMalloc calls. If so, you could torch.cat the data on the host before sending it to the GPU as a single allocation would be performed.

@ptrblck My tests have concluded that the lengthy delay when running my script on the GPU occurs in the torch.cat function specifically. The transfer to the CPU is quite rapid after that, taking under a second. However, if I attempt to first put every tensor on the CPU, then concatenate the lists of tensors, the GPU to CPU transfer takes the most time.

This is a benchmark program which I drafted:

import torch
import time

TestTensor = torch.rand(10, 3, 2000, 2000)  # Simulation for a 10 batch of 3 channel 2000x2000 images.

# Test tranfer to GPU.
print("Testing transfer to GPU...")
start = time.time()
TestTensor = TestTensor.cuda()
print(f"Transfer to GPU time: {(time.time()-start):.2f} seconds.")

# Test transfer to CPU.
print("Testing transfer to CPU...")
start = time.time()
TestTensor = TestTensor.cpu()
print(f"Transfer to CPU time: {(time.time()-start):.2f} seconds.")

My results:

Testing transfer to GPU...
Transfer to GPU time: 0.13 seconds.
Testing transfer to CPU...
Transfer to CPU time: 0.14 seconds.

That is what I mean by GPU->CPU and CPU->GPU transfer delay. :stopwatch: Is there a way to accelerate it? Furthermore, I executed this script to demonstrate the torch.cat process speed on GPUs:

import torch
import time

TestTensors = [torch.rand(3, 2000, 2000) for i in range(25)]  # Simulation for a 10 batch of 3 channel 2000x2000 images.

# GPU
start = time.time()
AllTensors = torch.cat(TestTensors, dim=0).to("cuda:1")
print("GPU:", round(time.time()-start, 2))

# CPU
start = time.time()
AllTensors = torch.cat(TestTensors, dim=0).cpu()
print("CPU:", round(time.time()-start, 2))

Results:

GPU: 0.53
CPU: 0.09

So clearly, the GPU takes significantly longer to perform the torch.cat operation on a list of tensors and receive tensors from the CPU. Once again, is there a way to accelerate it?