I am new to Pytorch, and I have recently been playing around with TorchServe to try to serve several convolutional models. TorchServe is run in a Docker container with a proper tag (‘gpu’ - the host system has a single RTX 3080). I have installed proper drivers and NVIDIA Container Toolkit. Also, I have built a local image with cu111 as described here.
I wanted the models to run predictions on the GPU. As far as I understood, the only obligatory condition for it is that both the model and the input tensor are loaded to the memory of the GPU. I have verified that the tensor and the model are loaded to the memory of GPU using
x.is_cuda (for input tensor) and
next(self.model.parameters()).is_cuda (for the model). Nvidia-smi tool also showed that the memory of the gpu was occupied by the torchserve process.
However, when querying the TorchServe for inference, the prediction was done entirely on CPU (as the CPU load was close to 100%, and the gpu was idle at 0% to 2%).
Could anyone please give the clue what I could be missing, and what options to try to verify I am doing everything correctly to run predictions on GPU?
Thank you for your time, would appreciate any clues!
P.S. Logs of TorchServe showed the GPU was seen and recognized, no warnings related to incompatible CUDA version and etc.