I am trying to deploy a Pytorch image classification model wrapped in Flask
on g4dn.xlarge
(4 vCPU, 16GB RAM, T4 GPU with 16GB Memory) instances on AWS.
For selecting the optimal number of workers I performed some experiments:
Note:
* Only `model.forward` part runs on GPU - the rest of the steps run on CPU in the application.
* I have added a timing logger for every step of the application for checking
Experiment 1:
gunicorn main.app:app -b 0.0.0.0:8000 --workers 1
Concurrent Requests: 1
Total Time To Process 15 Requests By A Client: 15.87s (model.forward
takes 14.98s)
Experiment 2:
gunicorn main.app:app -b 0.0.0.0:8000 --workers 2
Concurrent Requests: 2 (2 clients sending requests in parallel)
Total Time To Process 15 Requests By A Client: 29.35s (model.forward
takes 28.34s, 2x of a single request, every other step taking a similar time)
Experiment 3:
gunicorn main.app:app -b 0.0.0.0:8000 --workers 3
Concurrent Requests: 3 (3 clients sending requests in parallel)
Total Time To Process 15 Requests By A Client: 43.82s (model.forward
takes 41.81s, 3x of a single request, every other step taking a similar time).
Using 3x workers is enabling me to process 3 requests in parallel but the overall processing time of all those requests is also becoming 3x - hence no improvement in real terms.
I initially thought CPU or IO processes are the bottlenecks in the app - but upon intensively logging the time taken at each step, I found the bottleneck is coming from the GPU processing (model.forward
starts taking 2x-3x times).
Upon checking the process ids of the workers for each request - I can also confirm that all the workers are getting the requests in parallel - but those are not able to perform the GPU processings in parallel at the same time.
Any guidance on what can be the bottleneck here will be very helpful.
Also - is there a recommended worker type to be used for such kinds of processing which are GPU-dependent?