Problem running inference of models in Docker on GPU


I am new to Pytorch, and I have recently been playing around with TorchServe to try to serve several convolutional models. TorchServe is run in a Docker container with a proper tag (‘gpu’ - the host system has a single RTX 3080). I have installed proper drivers and NVIDIA Container Toolkit. Also, I have built a local image with cu111 as described here.

I wanted the models to run predictions on the GPU. As far as I understood, the only obligatory condition for it is that both the model and the input tensor are loaded to the memory of the GPU. I have verified that the tensor and the model are loaded to the memory of GPU using x.is_cuda (for input tensor) and next(self.model.parameters()).is_cuda (for the model). Nvidia-smi tool also showed that the memory of the gpu was occupied by the torchserve process.

However, when querying the TorchServe for inference, the prediction was done entirely on CPU (as the CPU load was close to 100%, and the gpu was idle at 0% to 2%).

Could anyone please give the clue what I could be missing, and what options to try to verify I am doing everything correctly to run predictions on GPU?

Thank you for your time, would appreciate any clues!

P.S. Logs of TorchServe showed the GPU was seen and recognized, no warnings related to incompatible CUDA version and etc.

This could point towards a CPU bottleneck, e.g. in the data loading.
Try to profile the code to see which operations are the actual bottleneck and to verify that the GPU is indeed used.

1 Like

Thanks for the hint! Had the same suspicion, will investigate…

Hi Timur, were you able to resolve this issue? A repro with your mar file on some shared drive would make it easy for me to debug. As far CPU bottlenecks go are you working with large inputs or large batch sizes?