Looking for tips on optimizing cpu inference with torchscript models


I have some recommendation models trained and torchscripted, which I’d like to use for model serving.
I’m using DJL which is a java library that loads torchscript models through JNI. Given I’m not doing much on the java side, I suppose the performance characteristics is largely dependently on the torchscript jit runtime.

Now my questions are:

  1. How should one typically tune inter-op and intra-op threads to get best performance in terms of throughput and tail latency?
    When I was working with tensorflow I used to set inter-op threads to the number of available cpus and intra-op threads to 1, which gives enough parallelism to handle a large number of requests. When I tried the same with torchscript model, unfortunately requests start queuing before even able to hit 50% cpu utilization. Is this expected?

  2. Are there any steps that should be taken to process the torchscript model before using it for inference?
    For example, I noticed torch.jit.freeze, but haven’t really tried it. How much performance gain can I typically expect by applying it?

  3. Are there any other knobs to tune for cpu inference performance?
    Apart from what’s mentioned above, what other knobs should one typically try to adjust for optimal performance?