Minimizing time moving tensors between CPU and GPU

I have a very large buffer of images (~1 million). They are all different. I can’t fit all of the images on the GPU, so they’re all on the CPU originally. Then, I sample from the buffer, do some things with the sampled images, and repeat. Something like (pseudocode):

buffer = torch.ones(10**6, 32, 32, 3) # images are 32x32x3
for step in range(10**8):
  images = sample(buffer, 256).to('cuda')
  update_model_with_images(images)

There’s no way to predict which images will be sampled, so I can’t pre-fetch them. When I profile my code using torch.utils.bottleneck, I find that almost all of the runtime is spent on moving tensors from CPU to GPU. This is frustrating. How can I increase performance without fitting the entire buffer on my GPU?

If the workload in your model is small, you would be limited by the memory bandwidth between the host and device.
I would recommend to manually profile the code and check, if the data transfer is really slower than the model execution via:

nb_iters = 100
torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
    images = buffer[batch_size].to('cuda')
torch.cuda.synchronize()
t1 = time.time()
print((t1 - t0) / nb_iters)

and the same for the model.

Sorry to bother you. But I have a very strange issue when using pytorch with cuda.

The code is used to predict multiple rgb-images’ depth information in a for loop. And during testing, in each loop, the network 1. predicts the depth information and then 2. transfer the tensor to cpu. However, either the first step or the second step will cost a lot of time ( almost 0.5 seconds). But they won’t take a long time at the same time. And more strangely, if I comment the transfer code, there might some other steps which take a rather long time to compute (seems the bottleneck is changing from time to time).

It seems not the problem of the code but might be an issue caused by pytorch. And I have tried torch.cuda.synchronize() which didn’t work. Do you have any idea about this wired situation?

Based on your description it sounds as if you are indeed not properly synchronizing the code while trying to profile it and are thus “moving” the time allocations around. To profile operations you need to synchronize the code before starting and stopping each timer. Otherwise your timers could only profile the kernel launch and would accumulate the entire runtime in the next blocking call.

1 Like