Had a paddle ocr model, converted it into pytorch model. Trying to deploy it so that I get high throughput and low latency.
Bottom line of what I have is ModelA → compute on CPU → ModelB. And I want to deploy it in the most optimal manner.
I know we can trace one model graph with aws neuron and “split” it so that its partitions run on different cores.
This got me thinking if I can hack the modelA, compute and modelB together into one graph, Does this sounds like a bad idea ?
Also if you had to deploy something like this how would you do it ? (just to constrain the problem lets assume we have to do all this in one container)