Replicating an already trained lstm model to serve multiple inputs in inference

Hello! Is there an efficient way to load a model (I already have the trained model) during inference, multiple times, parallely (replicate), that essenially share the same weights and characteristics but are fed different and variable length inputs?

Have one as master server. I don’t understand the context. In case of CPU you have torch.multiprocessing using which you can create new threads sharing the same data (i.e. your model) and execute those threads on different cores.
In case of GPU you will have to load the model onto each GPU memory as far as I understand.

1 Like

Thank you for your response.
Is there a specific and efficient way to load the model separately onto each GPU.?

Use the to(device) method.

model = MyModel()
model = model.to(torch.device('cuda:0')