Can anyone please recommend to me what is an efficient way to reduce the inference timing? Currently, my model takes 50sec to do the inference.
Objective: I have built the flask API and it takes 50sec to do the inference. So, I would like to reduce the inference timing and return the result from the flask app.
This is basically for real-time inference, I want to pass a single image to the flask API for inferencing.
using GPU p2.xlarge instance
Steps were already taken:
- attached model to GPU
Application: semantic segmentation model
It would be really helpful if anyone recommends an efficient way to optimize.
@ptrblck could you please help me.
I think it would be interesting to see which part of your full inference pipeline takes which amount of time. E.g. is the model itself using the majority of the 50s or is it the flask application?
Based on this you could then check how to improve your use case e.g by speeding up the model itself via the performance guide or by switching to
Triton, or any other inference serving application.
firstly, thanks for your reply @ptrblck. I did a check and I noticed model inference takes about 30sec and flask takes 20sec to convert the ndarray to list, json.dumps and return the result out of the application.
Now, I managed to reduce 20 sec to around 5sec in converting ndarray to list and json.dumps.
I’m literally trying to reduce the inference time.