I’m currently working on a pipeline that involves using OpenCV between two GPU computations. However, I encountered two problems:
OpenCV code runs on the CPU, which means that data must be transferred from the GPU to the CPU and back to the GPU again.
The OpenCV function can only process one sample (from a batch) at a time, which slows down the computation.
Here is an example of my code:
# GPU computation
x = ... # B, C, H, W
# Pre-allocate memory
outputs = torch.zeros(some_shape, device="cuda")
# Send each sample in the batch to the CPU
for b in range(B):
curr_x = x[b].cpu().numpy()
# Use a special sequential algorithm, so no GPU alternatives
output = cv2.function(...)
# send the sample back to the GPU
output = torch.Tensor(output).cuda()
outputs[b] = output
# Other GPU computations
The code runs slowly with low CPU and GPU usage.
I would be grateful for any suggestions or insights you may have.
Thank you very much!
I don’t know which function you are calling from OpenCV, but you might want to either check if OpenCV provides the CUDA version of this function or if you could use another library with GPU support (e.g. torchvision using native PyTorch operations or custom kernels).
I haven’t profiled these implementations and don’t know what inputs you are using, but note that even if OpenCV might be faster in isolation on the CPU, you will still pay the penalty of moving the data between the device and host and thus synchronizing the code.
Assuming kornia provides a GPU implementation no syncs would be added and the end2end time might still be faster. But as I said, I did not profile any of these methods.
Yes, I actually profiled the runtime of kornia.geometry.ransac (which is equivalent to cv2.findHomography(..., cv2.RANSAC). Unfortunately, even with the overhead of data transferring, the OpenCV version still seems to be about 10 times faster than the Kornia implementation.
Thanks for bringing up this function. Yes, I’ve also tested this one. The computation is a lot faster, but the lack of robust estimation (like RANSAC) leads to worse results in my case.
Lastly, thanks for this suggestion! Unfortunately, as you pointed out, my next operation actually needs the transferred tensor.
I was thinking about:
Whether sending the entire x batch to the CPU leads to any speedup over sending each sample in x separately. (The speedup will probably be marginal, though)
Whether it is possible to run the OpenCV function on all samples in the batch concurrently with multi-processing.
But in general, I guess it is difficult to significantly improve the runtime without large edits to the pipeline or to the OpenCV/Kornia source code.