How to see details behind CPU-only Libtorch Matrix-Matrix multiplication routines?

I have downloaded the libtorch CPU-only version from the website and unzipped it.

Inside my .cpp application, I write (I am using intel-mkl):

    omp_set_num_threads(64);
    mkl_set_num_threads(64);

I then check:

    std::cout << "torch::get_num_threads() returns: " << torch::get_num_threads() << std::endl;

    std::cout << "omp_get_max_threads() returns: " << omp_get_max_threads() << std::endl;
    std::cout << "mkl_get_max_threads() returns: " << mkl_get_max_threads() << std::endl;

These all return 64. (yes, I do have so many cores, I am on a HPC machine with 128 cores per node and I am launching 2 MPI processes per node).

I then perform std::complex<double> matrix-matrix multiplications via torch:matmul().
These multiplications, for me, seem to be slow.

How can I check that:

  1. Libtorch uses MKL behind the scenes
  2. Libtorch uses threads for its MM multiplications? Is my check from above guaranteeing that Libtorch uses more than 1 thread behind the scenes?

Thank you!

The linked libraries can be found with the “dependency walker” tool or similar on Windows and ldd on linux. A running process can be killed and the stack calls in coredump should reveal mkl calls.

The number of thread can be observed with the OS tools, Sysinternals Process Explorer on Windows or top on linux