PyTorch faster than Libtorch for CNN inference

I narrow down the comparison to the below code snippets (v16mms is a List/Vector of models):

spdlog::info("Running inference");
for (size_t i = 0; i < v16mms.size(); ++i) {
  outputs[i] = v16mms[i].forward(input).toTensor();
  output += outputs[i];

VS'Running inference')
for i in range(len(v16mms)):
    output += outputs[i]'Done')

Two snippets do the same thing, they run the same set of models and then add the inference results to output.

My tests show that PyTorch version takes ~230ms while the LibTorch version takes ~300ms. Any idea why LibTorch is even slower?
(In case you want a minimally reproducible example, you can find the LibTorch file here and the PyTorch file here)