Deploying multiple models that invoke each other

Had a paddle ocr model, converted it into pytorch model. Trying to deploy it so that I get high throughput and low latency.

Bottom line of what I have is ModelA → compute on CPU → ModelB. And I want to deploy it in the most optimal manner.

I know we can trace one model graph with aws neuron and “split” it so that its partitions run on different cores.

This got me thinking if I can hack the modelA, compute and modelB together into one graph, Does this sounds like a bad idea ?

Also if you had to deploy something like this how would you do it ? (just to constrain the problem lets assume we have to do all this in one container)

to be more specific here’s the code I am working with

        dt_boxes, elapse = self.text_detector(img)            # model 1
        img_crop_list = []                                    # compute
        for bno in range(len(dt_boxes)):                      # compute
            do something                                      # compute
            img_crop_list.append(something)                   # compute
        rec_res, elapse = self.text_recognizer(img_crop_list) # model 2