Model inference on multiple processes and multiple GPUs

amroghoneim · November 14, 2022, 10:41am

Hello, I have a dockerized endpoint setup using Flask + Gunicorn that receives images containing text and runs multiple models to return a response containing that text.

the pipelines consist of YOLOv5 for object detection , deeplabv3 for segmentation, an SSD model for detecting text fields, and EasyOCR recognition model for final text recognition step.

We use multiple workers (where each worker instantiates its own objects for these models) and 2 Tesla K80 GPUs. I see that we get GPU out of memory errors even though I tried with 1 P100 GPU (which has less GPU memory than the current setup) and that never happens. So my question is how do we setup the model pipeline correctly to not have these errors?

I would also like to ask what are the best practices when using multiple models like this on multiple workers/processes and multiple GPUs? Are there any possible issues that I should be careful of?

Is there also a way to measure how much GPU memory each model uses?