PyTorch faster than Libtorch for CNN inference

I narrow down the comparison to the below code snippets (v16mms is a List/Vector of models):

spdlog::info("Running inference");
for (size_t i = 0; i < v16mms.size(); ++i) {
  outputs[i] = v16mms[i].forward(input).toTensor();
  output += outputs[i];

VS'Running inference')
for i in range(len(v16mms)):
    output += outputs[i]'Done')

Two snippets do the same thing, they run the same set of models and then add the inference results to output.

My tests show that PyTorch version takes ~230ms while the LibTorch version takes ~300ms. Any idea why LibTorch is even slower?
(In case you want a minimally reproducible example, you can find the LibTorch file here and the PyTorch file here)

Firstly, if you are exporting the model via jit and then using it in c++ you should cycle the model derivation time a few more times before it stabilises, the initial few times are time consuming due to the need to build a graph
Load model successful !
t 277.576ms
t 213.905ms
t 187.971ms
t 86.0545ms
t 47.8444ms
t 47.9361ms
t 50.0757ms
t 60.5665ms
t 46.4637ms
t 46.4323ms
t 46.995ms
t 47.0363ms
Secondly, assignment between tensors in c++ is slow, if you want to speed up in c++ try using pointers.