Using Dynamic Parallelism in C++

Hi! I am working on a project where we load a TorchScript model into C++. We have tried to make a model utilizing dynamic parallelism as described in this tutorial. When we do inference with the model in Python, we see a reduction in run time when parallelism is used. However, when we load the model into C++, there is no improvement in run time. We have used torch.jit.fork to parallelize the forward pass in the model. When we run the torch.autograd.profiler in Python, we see that the forward pass runs in parallell, while in C++ we do not see this. Is even dynamic parallelism of jit::Modules supported in C++?