Model inference on multiple processes and multiple GPUs

Hello, I have a dockerized endpoint setup using Flask + Gunicorn that receives images containing text and runs multiple models to return a response containing that text.

the pipelines consist of YOLOv5 for object detection , deeplabv3 for segmentation, an SSD model for detecting text fields, and EasyOCR recognition model for final text recognition step.

We use multiple workers (where each worker instantiates its own objects for these models) and 2 Tesla K80 GPUs. I see that we get GPU out of memory errors even though I tried with 1 P100 GPU (which has less GPU memory than the current setup) and that never happens. So my question is how do we setup the model pipeline correctly to not have these errors?

I would also like to ask what are the best practices when using multiple models like this on multiple workers/processes and multiple GPUs? Are there any possible issues that I should be careful of?

Is there also a way to measure how much GPU memory each model uses?

Hey @amroghoneim have you seen/tried torchserve for your application? it has support for mult-model workflows, load-balancing (IIRC) and neat profile views to see how much memory and compute each worker requires.

1 Like

I definitely have not seen this before. Looks like what we need thank you for that!

However, this seems like more of a long term change given our current setup and priorities. In that case, what do you recommend doing so that I utilize the GPUs that I have without having issues?

Hard to say without knowing the specifics of your application. Assuming you’re using torch.mp each spawned process will have overhead which consumes your memory. What happens if you reduce the number of workers? Also dropping this here in case you haven’t seen it Multiprocessing best practices — PyTorch master documentation

Thanks I’ll check this out. Actually I’m not using torch.mp. Gunicorn workers create their own model objects independently.

object_detector = ObjectDetector()
segmenter = Segmenter()

etc. per worker. Do you think this might create any issues?

I’m not sure. Came across this answer on another thread, maybe it helps clarify things:

My understanding is that CUDA needs threadsafe multiprocessing and that is why torch has its own implementation. When we set up Gunicorn to manage workers, this may be causing some conflict or thread safety issues.
google cloud platform - Running PyTorch multiprocessing in a Docker container with Gunicorn worker manager - Stack Overflow

Looks like removing the --preload flag or torch.set_num_threads(1) seems to work?