Pytorch inference is slow inside Flask API

I encounter a problem with the inference speed between inference in main thread and inside Flask API enpoint (which is inside another thread). The inference speed is about 10 times slower.
Do you have any idea about this problem and any solution?


Do you know if flask is reducing the priority of these worker threads maybe?

Another possibility is that these worker threads don’t use multithreading to process your model?
Or that so many of these threads use multithreading that you over-use the CPU which leads to slow down?

I have tested it with a single API running pytorch model inference. The speed is also lower than that running in main thread, but much less worse (2 times slower). That is, multiple threads running could take up CPU usage. However, in the case of a single API, i did not figure out how to improve it

Do you run your PyTorch model in Python or have you JITed it to TorchScript?
We discuss this a bit in Chapter 15 of our book, if you run your model through the JIT - even from Python - you avoid the infamous Python GIL for all intermediates, which may give you an edge in multithreaded deployment scenarios.

Best regards


1 Like

I run Pytorch model in Python

I would suspect that JITing the model would already help, then.

Thanks for your help, i have tried JIT the model and it solved the inference speed problem. I am currently verifying the performance to see if there is any degradation?