Best practice for GPU -> CPU -> GPU computation

Hi everyone,

I’m currently working on a pipeline that involves using OpenCV between two GPU computations. However, I encountered two problems:

  1. OpenCV code runs on the CPU, which means that data must be transferred from the GPU to the CPU and back to the GPU again.

  2. The OpenCV function can only process one sample (from a batch) at a time, which slows down the computation.

Here is an example of my code:

# GPU computation
x = ...  # B, C, H, W

# Pre-allocate memory
outputs = torch.zeros(some_shape, device="cuda")

# Send each sample in the batch to the CPU
for b in range(B):
    curr_x = x[b].cpu().numpy()

    # Use a special sequential algorithm, so no GPU alternatives
    output = cv2.function(...)  

    # send the sample back to the GPU
    output = torch.Tensor(output).cuda()
    outputs[b] = output

# Other GPU computations

The code runs slowly with low CPU and GPU usage.

I would be grateful for any suggestions or insights you may have.
Thank you very much!

I don’t know which function you are calling from OpenCV, but you might want to either check if OpenCV provides the CUDA version of this function or if you could use another library with GPU support (e.g. torchvision using native PyTorch operations or custom kernels).

Hi @ptrblck,

Thank you for your suggestion!

The function I’m using is homography matrix estimation with RANSAC:
cv2.findHomography(..., cv2.RANSAC).

However, OpenCV does not provide an off-the-shelf CUDA version of it. And existing implementation using PyTorch (e.g., Kornia) is even slower than OpenCV.

Moreover, both implementations require a single sample as input, instead of a batch.

So I was wondering if it is possible to optimize the code from the perspective of data transferring or parallelization. Do you have any suggestions on this?

Thanks again for your help!

I haven’t profiled these implementations and don’t know what inputs you are using, but note that even if OpenCV might be faster in isolation on the CPU, you will still pay the penalty of moving the data between the device and host and thus synchronizing the code.
Assuming kornia provides a GPU implementation no syncs would be added and the end2end time might still be faster. But as I said, I did not profile any of these methods.

kornia.geometry.homography.find_homography_dlt seems to expect batched inputs based on the docs. CC @edgarriba to correct me.

I’m unsure how you want to do it. You could try to move the data with non_blocking=True, but if the next operation is actually needing the transferred tensor you won’t be able to overlap the transfer.

Thank you for the prompt reply, @ptrblck!

Yes, I actually profiled the runtime of kornia.geometry.ransac (which is equivalent to cv2.findHomography(..., cv2.RANSAC). Unfortunately, even with the overhead of data transferring, the OpenCV version still seems to be about 10 times faster than the Kornia implementation.

Thanks for bringing up this function. Yes, I’ve also tested this one. The computation is a lot faster, but the lack of robust estimation (like RANSAC) leads to worse results in my case.

Lastly, thanks for this suggestion! Unfortunately, as you pointed out, my next operation actually needs the transferred tensor.

I was thinking about:

  1. Whether sending the entire x batch to the CPU leads to any speedup over sending each sample in x separately. (The speedup will probably be marginal, though)

  2. Whether it is possible to run the OpenCV function on all samples in the batch concurrently with multi-processing.

But in general, I guess it is difficult to significantly improve the runtime without large edits to the pipeline or to the OpenCV/Kornia source code.

Thank you again for your time and help:)

  1. Yes, I would expect to see a higher bandwidth if the transfer can be batched instead of calling separate copy kernels on each sample.

  2. I don’t know if that’s possible.