Model inference on multiple processes and multiple GPUs

amroghoneim · November 14, 2022, 10:41am

Hello, I have a dockerized endpoint setup using Flask + Gunicorn that receives images containing text and runs multiple models to return a response containing that text.

the pipelines consist of YOLOv5 for object detection , deeplabv3 for segmentation, an SSD model for detecting text fields, and EasyOCR recognition model for final text recognition step.

We use multiple workers (where each worker instantiates its own objects for these models) and 2 Tesla K80 GPUs. I see that we get GPU out of memory errors even though I tried with 1 P100 GPU (which has less GPU memory than the current setup) and that never happens. So my question is how do we setup the model pipeline correctly to not have these errors?

I would also like to ask what are the best practices when using multiple models like this on multiple workers/processes and multiple GPUs? Are there any possible issues that I should be careful of?

Is there also a way to measure how much GPU memory each model uses?

suraj.pt · November 14, 2022, 4:45pm

Hey @amroghoneim have you seen/tried torchserve for your application? it has support for mult-model workflows, load-balancing (IIRC) and neat profile views to see how much memory and compute each worker requires.

amroghoneim · November 15, 2022, 9:26am

I definitely have not seen this before. Looks like what we need thank you for that!

However, this seems like more of a long term change given our current setup and priorities. In that case, what do you recommend doing so that I utilize the GPUs that I have without having issues?

suraj.pt · November 15, 2022, 1:52pm

Hard to say without knowing the specifics of your application. Assuming you’re using torch.mp each spawned process will have overhead which consumes your memory. What happens if you reduce the number of workers? Also dropping this here in case you haven’t seen it Multiprocessing best practices — PyTorch master documentation

amroghoneim · November 15, 2022, 3:14pm

Thanks I’ll check this out. Actually I’m not using torch.mp. Gunicorn workers create their own model objects independently.

object_detector = ObjectDetector()
segmenter = Segmenter()

etc. per worker. Do you think this might create any issues?

suraj.pt · November 15, 2022, 4:37pm

I’m not sure. Came across this answer on another thread, maybe it helps clarify things:

My understanding is that CUDA needs threadsafe multiprocessing and that is why torch has its own implementation. When we set up Gunicorn to manage workers, this may be causing some conflict or thread safety issues.
google cloud platform - Running PyTorch multiprocessing in a Docker container with Gunicorn worker manager - Stack Overflow

Looks like removing the --preload flag or torch.set_num_threads(1) seems to work?