Hi,
I have some recommendation models trained and torchscripted, which I’d like to use for model serving.
I’m using DJL which is a java library that loads torchscript models through JNI. Given I’m not doing much on the java side, I suppose the performance characteristics is largely dependently on the torchscript jit runtime.
Now my questions are:
-
How should one typically tune inter-op and intra-op threads to get best performance in terms of throughput and tail latency?
When I was working with tensorflow I used to set inter-op threads to the number of available cpus and intra-op threads to 1, which gives enough parallelism to handle a large number of requests. When I tried the same with torchscript model, unfortunately requests start queuing before even able to hit 50% cpu utilization. Is this expected? -
Are there any steps that should be taken to process the torchscript model before using it for inference?
For example, I noticed torch.jit.freeze, but haven’t really tried it. How much performance gain can I typically expect by applying it? -
Are there any other knobs to tune for cpu inference performance?
Apart from what’s mentioned above, what other knobs should one typically try to adjust for optimal performance?
Thanks!