Asynchronous inference with weight sharing (CPU)

I was wondering if it is possible to spawn multiple threads using the same model memory (weight sharing) to run an asynchronous inference on the CPU.

Ideally, I would like to launch a new thread in a server to fulfill a request ASAP, and launch a second one to serve a second request that arrives in the meantime. Is this possible in pytorch? (i.e is pytorch thread safe?)

Cheers,
Miguel

1 Like