C++ runs slower than python version

I have a requestion, the C++ code belows:
auto start = std::chrono::system_clock::now();
//sparse_segment_forward(input_batch, values, indices);
auto a = torch::randn({10000, 100});
auto b = torch::randn({1000,10000});
auto ret = b.mm(a);
auto end = std::chrono::system_clock::now();
std::chrono::duration elapsed_seconds = end - start;
std::cout << "sparse embedding elapsed time: " << elapsed_seconds.count() << “s\n”;
it shows the mm operation cost 0.5s.
but my python version as follows:
x = torch.randn(10000,100,device=device, dtype=dtype)
y = torch.randn(1000, 10000, device=device, dtype=dtype)
import torch
device = torch.device(“cpu”)
dtype = torch.float
import time
start = time.time()
x = y.mm(x)
end = time.time()
print end -start
python program costs 0.02s。 Both run the same machine,so I want to know if I have not use the cpu
openblas,and I want to ask how to use the openblas in C++.thanks.