How to get fast inference with Pytorch and MXNet model using GPU?

Hello PyTorchers.
I used the word PyTorcher meaning the person uses Pytorch.
I am not sure this expression is appropriate, so comment to me about this, please XD


For my task (face detection), I am using two deep learning framework, Pytorch and MXNet.

Please look at the diagram below to help your understanding.

When the code was first implemented, GPU was used for the process above.
But the inference time was not that fast.

To compare inference time, I later have created other codes that use CPU only(installed No CUDA libraries).

I found that there was no difference between GPU processing and CPU processing in terms of inference time. (I measure FPS but there was no dramatic difference.)

  1. Could you give me ideas to use GPU efficiently for fast inference?
  2. Is this normal that the Volatile GPU-util is almost 0% when PyTorch model is on inference mode?
  3. I capture the RTSP frame using VideoStream class from imutils package which use cv2 and thread package. Do you have any idea for fast realtime ( < 2 sec delay) visualization?

I will definitely respond to all of your comments.

Really thank you so much in advance.

How did you implement the communication between MXNet and PyTorch?

The GPU utilization might be low, if your workload is small or if your application has a bottleneck in another part of the code.
How large is your input and how fast is the data loading (including MXNet processing) compared to the forward pass of your PyTorch model?

1 Like

Thank you @ptrblck.

For your question,

  1. I just get numpy.array features from MXNet1 and MXnet2 as a following figure.


    I put all the models on the same GPU (GPU:0) as you can see below.
    image

  2. Input size is not that big. Using cv2 based library, I get frame using rtsp and resize the frame into (270, 470, 3) and throw it to the model.

  3. I need to measure the processing time at each step. Is it recommend to use time.time() before and after the code block?

Thanks for the information.

  1. In that case you are adding synchronizations, as numpy uses CPU arrays. Your workflow would therefore probably be:
load data on CPU -> transfer to GPU and use MXNet model -> transfer back to CPU -> transform to PyTorch tensor and transfer back to GPU -> use forward pass of PyTorch model -> prediction

which most likely won’t benefit a lot from the GPU. You could try to use both models on the CPU only and compare the processing time. Since you are synchronizing and also transferring the data between the host and device multiple times, the GPU utilization could be low.

  1. Especially for small inputs, the workload on the GPU is small and you might see the overhead of the kernel launch times as well as the data transfer.

  2. Note that CUDA operations in PyTorch are asynchronous, so you would need to synchronize the code manually to profile the desired operation before starting and stopping the timer via torch.cuda.synchronize().

1 Like

Thank you so much @ptrblck.
Your comments really helpful everytime.

You made that I want to contribute to Pytorch forum.
I will do my best with Pytorch and share my knowledge with Pytorchers.

From South Korea

Thanks.

1 Like