How to see details behind CPU-only Libtorch Matrix-Matrix multiplication routines?

I have downloaded the libtorch CPU-only version from the website and unzipped it.

Inside my .cpp application, I write (I am using intel-mkl):

    omp_set_num_threads(64);
    mkl_set_num_threads(64);

I then check:

    std::cout << "torch::get_num_threads() returns: " << torch::get_num_threads() << std::endl;

    std::cout << "omp_get_max_threads() returns: " << omp_get_max_threads() << std::endl;
    std::cout << "mkl_get_max_threads() returns: " << mkl_get_max_threads() << std::endl;

These all return 64. (yes, I do have so many cores, I am on a HPC machine with 128 cores per node and I am launching 2 MPI processes per node).

I then perform std::complex<double> matrix-matrix multiplications via torch:matmul().
These multiplications, for me, seem to be slow.

How can I check that:

  1. Libtorch uses MKL behind the scenes
  2. Libtorch uses threads for its MM multiplications? Is my check from above guaranteeing that Libtorch uses more than 1 thread behind the scenes?

Thank you!