GPU => CPU memory transfer time changes

Hi, I want to optimize the transfer time of data from GPU to CPU. I am using RTX 3060 with a CPU of i7-10700. First I check the bandwidth of Cuda tensor to pinned-memory CPU tensor on c++ using the code in this blog (https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/). It is 12.33 GB/s, which is reasonable. Then I check the transfer time of my python code shown below.

sample_number = 65
imgs_display = torch.zeros([sample_number,800,1280],dtype = torch.uint8,device='cuda')
imgData = torch.zeros(sample_number*800*1280,dtype = torch.uint8,device='cpu',pin_memory = True)
time_count = np.zeros(n_iter)
for n in range(n_iter):
    # here I hide the code for updating the imgs_display matrix based on the feedback
    torch.cuda.synchronize()
    start = time.perf_counter()
    imgData.copy_(imgs_display.flatten())
    torch.cuda.synchronize()
    time_count[n] = time.perf_counter()-start
    # here I hide the code for projecting the generated image patterns and getting the feedback signal      

I notice that if I comment the code for projecting images, the transfer time in each iteration is about 0.0053 s. The size of imgs_display is 66 MB and then the transfer speed is about 12.45 GB/s. matching the bandwidth measured in the benchmark. If I run the code for the first time (which means there is no warmup), the value is higher (0.02 s) at first. This high value at the beginning will disappear after I run it again.

Then, if I add the code for projecting images after transferring, the transfer speed is much slower. To mimic this situation, I use time.sleep() and the code is shown below:

imgs_display = torch.zeros([sample_number,800,1280],dtype = torch.uint8,device='cuda')
imgData = torch.zeros(sample_number*800*1280,dtype = torch.uint8,device='cpu',pin_memory = True)
time_count = np.zeros(n_iter)
for n in range(n_iter):
    # here I hide the code for updating the imgs_display matrix based on the feedback
    torch.cuda.synchronize()
    start = time.perf_counter()
    imgData.copy_(imgs_display.flatten())
    torch.cuda.synchronize()
    time_count[n] = time.perf_counter()-start
    time.sleep(0.02)

Here is the result for sleep time of 0, 0.02, 0.05 s:

For time.sleep longer than 0.1 s, it is always 0.02 s:

Given these observations, I have the following questions:

  1. From the fact the transfer speed is related to the sleep time during iteration, I assume it is caused by warmup, but which part is causing this problem? The temperature or something else?
  2. In the real experiment, the kernel need to wait for the detector to get the feedback signal, and the time is not negligible (~0.1 s per iteration). Then the transfer time is almost four times larger than the one without sleep time. Is there anything I can do in order to keep the high transfer speed measured in the benchmark? I am wondering if I should find a way to keep GPU busy during the data acquisition time…

Thank you in advance!

  1. I would guess your GPU puts itself into IDLE mode to save power when no work is needed. nvidia-smi will show the current power state in Perf. You could activate persistence mode and lock the clocks for proper profiling, but your GPU would then use of course more power without doing actual work.

  2. Same as above - try to activate persistence mode.

Thank you so much for your speedy reply!

It seems that persistence mode is not available on Windows… I found this answer from a forum:

On Windows, nvidia-smi is not able to set persistence mode. Instead, you need to set your computational GPUs to TCC mode. This should be done through NVIDIA’s graphical GPU device management panel.

It’s true that nvidia-smi shows TCC/WDDM instead of Persistence Mode in my case. Unfortunately, there is no TCC support on RTX GPUs. I find a useful answer from Nvidia Forum for those who are interested in (Do all of the new RTX GPUs support TCC mode? - #9 by genifycom - CUDA Setup and Installation - NVIDIA Developer Forums).

It should be noted that “TCC” is a unique feature / restriction relevant only to Windows. It’s sole feature is to allow peering in the Windows environment where that is otherwise not available because of Windows OS restrictions. GeForce cards have always been restricted in terms of peering in Windows.
In Linux however, any version of Linux, TCC is irrelevant. What is relevant is hardware peer-to-peer capability enabling UVA (unified virtual addressing). Up until the 2000 series, All 900 series and 1000 series GeForce cards could peer (share memory) over pcie in Linux - which is really what this is all about.
Beginning in the 2000 series of GeForce cards, peer-to-peer cannot be achieved over pcie. You could connect up to 2 cards over NVLink that had that interface (2070+). That restriction applies both to Windows, Linux, or any other OS environment, as well.

I know now the question is not just about to pytorch, but do you have any suggestion based on this situation?

Thanks for sharing the link and answers!
No, I’m not familiar enough with Windows and according to these answers TCC doesn’t seem to be available for your RTX GPU, so you might want to try out a Linux-based OS.