Why libtorch Tensor indexing much slower than C++ vector?

I find that indexing an at::Tensor using libtorch is much slower than indexing a C++ vector, is that normal?

Demo code like:

#include <torch/script.h> // One-stop header.

#include <iostream>
#include <memory>
#include <string>
#include <chrono>

int main(int argc, const char* argv[]) {
  at::Tensor test_tensor(at::zeros({16000, 100, 4}, at::kFloat));
  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < 80000; ++i) {
    auto idx = test_tensor[15999][0][0];
  }
  auto end = std::chrono::steady_clock::now();
  double time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
  std::cout << "tensor indexing cost " << time << " ms." << std::endl;

  std::vector<float> v1(4);
  std::vector<std::vector<float>> v2(100, v1);
  std::vector<std::vector<std::vector<float>>> test_vector(16000, v2);
  start = std::chrono::steady_clock::now();
  for (int i = 0; i < 80000; ++i) {
    auto idx = test_vector[15999][0][0];
  }
  end = std::chrono::steady_clock::now();
  time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
  std::cout << "vector indexing cost " << time << " ms." << std::endl;
}

ouput is:

tensor indexing cost 153 ms.
vector indexing cost 0 ms.

Environment

ubuntu 18.04
cmake 3.15.5
g++ 7.5.0
libtorch 1.6.0+cu101
GPU Intel® Core™ i7-8750H

See this:

Accessor is much faster but still slower than vector. Regarding why index is so slow, look at this:

Tensor Tensor::index(ArrayRefat::indexing::TensorIndex indices) const {
TORCH_CHECK(indices.size() > 0, “Passing an empty index list to Tensor::index() is not valid syntax”);
OptionalDeviceGuard device_guard(device_of(*this));
return at::indexing::get_item(*this, indices);
}

I suspect the device guard is heavy. Further, the get_item call is here:

self_device is even pass around. My guess is that once CPU or GPU device stuff is involved. It is going to be expensive.