Torchscript not utilizing OpenMP threads (intra-op)

Hi, we have recently started working on implementing model inference via torchscript in libtorch (C++), we use methods described in the documentation, where we generate a model in python using troch.jit.script(), save it to a file, load in c++ using torch::jit::load() and use it as a torch::jit::Module. One thing we noticed is that when we run our models in C++ this way, we do not get any parallelism from OpenMP (intra-op), or inter-op parallelism (when specifying a torch.jit.fork(), method in our forward pass in the python model). We have verified that we have OpenMP working, and we have set the number of threads using torch::set_num_threads(), and the corresponding get() function returns correct number. When we run our models with 8 intra-op threads we expect around 800% CPU usage, yet we get only 100%. We looked into this github issue, and even when we run the code that wizardk posted (where he says he gets 2400% CPU usage) we only get 100% (when we specify 1 working thread). We have verified that we get expected CPU usage (800% when specifying 8 intra-op threads) if we recreate the model using a sequential with a bunch of linear modules, so the CPU usage problem only occurs when running torchscript models. Do you have any suggestions, or thoughts why this may be the case?