Libtorch index API perform poorly under repetitive usage

Hey folks:

I have a Libtorch project that has some execution speed issues. I have spent half a day to debug it. It turns out that using index API repetitively is much less efficient compared to using an accessor.

Here is the doc on index API:
https://pytorch.org/cppdocs/notes/tensor_indexing.html

Basically, the job is to read lots of tensors and copy their value into a data object in memory:

  for (int i = 0; i < tensor.size(0); i++)
    for (int j = 0; j < tensor.size(1); j++)
      for (int k = 0; k < tensor.size(2); k++)
      {
        data->setData(i, j, k, tensor.index({i, j, k}).item<float>())
      }

I benchmarked the above operation, it takes 3600ms.

Then I modified above using accessor API:

auto tensorAccessor = tensor.accessor<float, 3>();
for (int i = 0; i < tensor.size(0); i++)
for (int j = 0; j < tensor.size(1); j++)
for (int k = 0; k < tensor.size(2); k++)
{
data->setData(i, j, k, tensorAccessor[0][i][j][k])
}

The same operation takes only 900ms.

BTW, tensor is on the CPU device side.

So accessor API is way more efficient compared to index (4X performance gain).

Filed a bug report: