How to speed up lstm in libtorch?

I convert a 2 layer bilstm model to jit sciript model, and the program run correctly. If I want to speed up the inference on cpu backend, what can I do for the lstm operation. Is there any configuration like multi-thread helpful?

There is the C++/CUDA LLTM (not LSTM) implementation here: