Is libtorch c++ auto jit any computation?

These days I built libtorch 1.12 from source under cuda 11.6.2 docker image with intel-MKL v2020.0.166 in the official repository (using tbb as parallel framework instead of default openmp (CPU threading and TorchScript inference — PyTorch 1.12 documentation). I also tried several times to build with pip installed MKL 2022.1/2021.4 and oneTBB 2021.6/2021.5, but always have some build or runtime problems). It built successfully and run libtorch example Installing C++ Distributions of PyTorch — PyTorch master documentation fine.
Then I also faced the speed problem: it runs some test code https://discuss.pytorch.org/t/my-libtorch-binary-built-from-source-is-slower-than-official-binary/44503 slower than pytorch about 5-40x times. In pytorch following code cost ~0.001-0.002s, but in built libtorch with mkl it cost average ~0.02-0.03s, sometime > 0.04s.
CPU:AMD5900x, 32GB memory, GPU: Nvidia 3070 super.
libtorch:

#include<torch/torch.h>
#include<iostream>
#include <chrono>

int main(){
    torch::Tensor tensor = torch::randn({2708, 1433});
    torch::Tensor weight = torch::randn({1433, 16});
    auto start = std::chrono::high_resolution_clock::now();
    tensor.mm(weight);
    auto end = std::chrono::high_resolution_clock::now();
    std::cout<< "C++ Operation Time(s) " << std::chrono::duration<double>(end - start).count() << "s" << std::endl;
    return 0;
}

pytorch:

import torch
import torch.nn as nn
import torch.nn.functional as F

tensor = torch.randn(2708, 1433)
weight = torch.randn(1433, 16)
t0 = time.time()
tensor.mm(weight)
t1 = time.time()
print("Python Operation Time(s) {:.4f}".format(t1 - t0))

By google, someone noted to use at::set_num_threads() to adjust parallel threads number in each process https://github.com/pytorch/pytorch/issues/20156. no effect.
someone said change debug mode to release mode and use -O3 optimization flag. no effect.
Finally I saw someone metioned libtorch need warmup step https://discuss.pytorch.org/t/my-libtorch-binary-built-from-source-is-slower-than-official-binary/44503/11. So I tried to add for (int i=0; i<100; i++) in the above code to test. The result shows that first run cost ~0.02-0.03s, then following 99 loops keep cost same time ~0.001-0.002s, just reach pytorch speed.
So the questions are:

    1. it appears that libtorch c++ use auto jit mechanism to create a fused graph even in normal computation process, like arrayfile. is that so? if not, how to explain the warmup phenomenon? Above test code has just normal operations, and no torchscrpit to call torch::jit load.
    1. why pytorch is always fast and no warmup steps? and how? is there some build options to change libtorch’s inner computation behavior? after all, pytorch is also based on c++ low level.