Pytorch tensor copy to GPU slow on A6000

Hello all,

We recently bought a A6000 GPU and were surprised to find that the pytorch tensor copy speed are much slower from what we have seen on the V100. Almost 2x slower.

We test using the below code snippet to simulate multiple 640x480 images being copied to the GPU memory.

import torch
import time
import cv2
import numpy as np
from tqdm import tqdm

image = cv2.imread("sample_image.jpg").astype(np.float64)
#print(image.dtype)
device = "cuda:1"
times = []
for i in tqdm(range(1000)):
    image = image + np.random.rand()
    start = time.time()
    tensor = torch.tensor(image).to(device)
    #print("Time taken to copy to gpu = ",time.time()-start)
    times.append(time.time()-start)
    
print(f"Mean CPU->GPU copy time over {len(times)} iters = {np.mean(times)}")
print(f"Median CPU->GPU copy time over {len(times)} iters = {np.median(times)}")
print(f"Total CPU->GPU copy time over {len(times)} iters = {np.sum(times)}")
print(f"Total (except 1st) CPU->GPU copy time over {len(times)} iters = {np.sum(times[1:])}")
print(f"Std CPU->GPU copy time over {len(times)} iters = {np.std(times)}")

When running on the A6000 we get -

Mean CPU->GPU copy time over 1000 iters = 0.09747219634056091
Median CPU->GPU copy time over 1000 iters = 0.0958176851272583
Total CPU->GPU copy time over 1000 iters = 97.47219634056091
Total (except 1st) CPU->GPU copy time over 1000 iters = 96.1502206325531
Std CPU->GPU copy time over 1000 iters = 0.03929998406726696

When running on the V100 we get -

Mean CPU->GPU copy time over 1000 iters = 0.0533141655921936
Median CPU->GPU copy time over 1000 iters = 0.0502018928527832
Total CPU->GPU copy time over 1000 iters = 53.3141655921936
Total (except 1st) CPU->GPU copy time over 1000 iters = 50.651899576187134
Std CPU->GPU copy time over 1000 iters = 0.08255929702503094

I have tried upgrading to the latest torch 1.13.1 on the A6000, CUDA 11.7, CUDNN 8.5. NVIDIA driver version - 520.61.05

I read on the A6000 datasheet that the memory interface is 384-bit for A6000 (link) and could find in some websites that the same is 4096-bit for V100 (link). Is that the cause for the slowness? Many thanks.

Your current code is not synchronizing the device so the profile could report invalid values.
Adding syncs, I see a bandwidth of ~8GB/s for the A6000 in my system and the same for a V100, which matches the results from the bandwidthTest from the cuda-samples.
You could also clone this sample, execute the test via ./bandwidthTest --memory pageable --mode shmoo --htod to get some estimate about the bandwidth, and compare it to your expectation depending on your system, the PCIe bandwidth etc.

Your total time is also way too long as the code finishes in ~0.5s in my setups.

Hi @ptrblck : Thanks for your insights, I ran the same benchmark on my setups and I see that the bandwidth on average in indeed slow on my A6000 vs my V100 GPU. Can you help us understand why that might be the case?

Running on...

 Device 0: NVIDIA RTX A6000
 Shmoo Mode

.................................................................................
 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   1000				0.1
   2000				0.1
   3000				0.2
   4000				0.3
   5000				0.4
   6000				0.4
   7000				0.5
   8000				0.6
   9000				0.6
   10000			0.7
   11000			0.8
   12000			0.8
   13000			0.9
   14000			0.9
   15000			1.0
   16000			1.1
   17000			1.0
   18000			1.2
   19000			1.2
   20000			1.2
   22000			0.9
   24000			1.2
   26000			1.5
   28000			1.6
   30000			1.7
   32000			1.7
   34000			1.7
   36000			1.8
   38000			1.9
   40000			1.9
   42000			1.7
   44000			1.9
   46000			2.0
   48000			2.1
   50000			2.1
   60000			2.3
   70000			2.3
   80000			2.2
   90000			2.5
   100000			2.7
   200000			3.2
   300000			3.4
   400000			3.6
   500000			3.7
   600000			3.8
   700000			3.8
   800000			3.9
   900000			3.9
   1000000			3.9
   2000000			4.0
   3000000			4.1
   4000000			4.1
   5000000			4.1
   6000000			4.1
   7000000			4.2
   8000000			4.2
   9000000			4.2
   10000000			4.2
   11000000			4.2
   12000000			4.2
   13000000			4.2
   14000000			4.2
   15000000			4.2
   16000000			4.2
   18000000			4.2
   20000000			4.1
   22000000			4.1
   24000000			4.1
   26000000			4.1
   28000000			4.1
   30000000			4.1
   32000000			4.1
   36000000			4.1
   40000000			4.2
   44000000			4.1
   48000000			4.1
   52000000			4.1
   56000000			4.1
   60000000			4.1
   64000000			4.1
   68000000			4.1

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Below is my result on V100

CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro GV100
 Shmoo Mode

.................................................................................
 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   1000				0.1
   2000				0.2
   3000				0.4
   4000				0.5
   5000				0.6
   6000				0.7
   7000				0.8
   8000				0.9
   9000				1.0
   10000			1.1
   11000			1.2
   12000			1.3
   13000			1.4
   14000			1.5
   15000			1.5
   16000			1.7
   17000			1.6
   18000			1.5
   19000			1.9
   20000			1.8
   22000			2.1
   24000			2.1
   26000			2.4
   28000			2.5
   30000			2.5
   32000			2.5
   34000			2.8
   36000			2.7
   38000			2.8
   40000			3.1
   42000			3.0
   44000			3.2
   46000			3.5
   48000			3.5
   50000			3.6
   60000			3.2
   70000			3.0
   80000			3.7
   90000			4.1
   100000			4.4
   200000			5.5
   300000			5.8
   400000			6.1
   500000			6.4
   600000			6.6
   700000			6.8
   800000			6.4
   900000			6.8
   1000000			7.0
   2000000			7.1
   3000000			7.2
   4000000			7.3
   5000000			7.3
   6000000			7.4
   7000000			7.4
   8000000			7.4
   9000000			7.4
   10000000			7.5
   11000000			7.5
   12000000			7.5
   13000000			7.5
   14000000			7.5
   15000000			7.5
   16000000			7.4
   18000000			7.5
   20000000			7.5
   22000000			7.5
   24000000			7.5
   26000000			7.5
   28000000			7.5
   30000000			7.5
   32000000			7.5
   36000000			7.5
   40000000			7.5
   44000000			7.5
   48000000			7.5
   52000000			7.5
   56000000			7.5
   60000000			7.5
   64000000			7.6
   68000000			7.5

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

What’s the specification of the GPU connection and what would be the expected bandwidth?

Hi @ptrblck ,

Sorry for the delay, it took me a while to find the right tool for the analysis. On running nvtop. I see that both the GPUs on the A6000 machine are connected to PCIe gen3 x16 connectors. Also the V100 machine also has the same connection. So based on the maximum bandwidth permitted (16 GB/s). I would expect the A6000 to perform similarly or better than the V1000 GPU machine in terms of bandwidth. Is my assessment correct?

How can I do syncs? Should I add torch.cuda.synchronize() somewhere?

Yes, if you are using host timers you would need to synchronize the code via torch.cuda.synchronize() before starting and stopping the timers.

Thanks, so it should be like below?

    torch.cuda.synchronize()
    start = time.time()
    tensor = torch.tensor(image).to(device)
    torch.cuda.synchronize()
    times.append(time.time()-start)