Why libtorch Tensor indexing much slower than C++ vector?

SeanZhang777 · September 4, 2020, 6:01am

I find that indexing an at::Tensor using libtorch is much slower than indexing a C++ vector, is that normal?

Demo code like:

#include <torch/script.h> // One-stop header.

#include <iostream>
#include <memory>
#include <string>
#include <chrono>

int main(int argc, const char* argv[]) {
  at::Tensor test_tensor(at::zeros({16000, 100, 4}, at::kFloat));
  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < 80000; ++i) {
    auto idx = test_tensor[15999][0][0];
  }
  auto end = std::chrono::steady_clock::now();
  double time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
  std::cout << "tensor indexing cost " << time << " ms." << std::endl;

  std::vector<float> v1(4);
  std::vector<std::vector<float>> v2(100, v1);
  std::vector<std::vector<std::vector<float>>> test_vector(16000, v2);
  start = std::chrono::steady_clock::now();
  for (int i = 0; i < 80000; ++i) {
    auto idx = test_vector[15999][0][0];
  }
  end = std::chrono::steady_clock::now();
  time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
  std::cout << "vector indexing cost " << time << " ms." << std::endl;
}

ouput is:

tensor indexing cost 153 ms.
vector indexing cost 0 ms.

Environment

ubuntu 18.04
cmake 3.15.5
g++ 7.5.0
libtorch 1.6.0+cu101
GPU Intel® Core™ i7-8750H

Lin_Jia · September 23, 2020, 5:20am

See this:

Accessor is much faster but still slower than vector. Regarding why index is so slow, look at this:

github.com

pytorch/pytorch/blob/1494005cfdf228f8441c1a7aecdfe1c3dfc77de6/aten/src/ATen/TensorIndexing.cpp

#include <ATen/TensorIndexing.h>

#include <c10/util/Exception.h>

namespace at {
namespace indexing {

const EllipsisIndexType Ellipsis = EllipsisIndexType();

std::ostream& operator<<(std::ostream& stream, const Slice& slice) {
  stream << slice.start() << ":" << slice.stop() << ":" << slice.step();
  return stream;
}

std::ostream& operator<<(std::ostream& stream, const TensorIndex& tensor_index) {
  if (tensor_index.is_none()) {
    stream << "None";
  } else if (tensor_index.is_ellipsis()) {
    stream << "...";
  } else if (tensor_index.is_integer()) {

This file has been truncated. show original

Tensor Tensor::index(ArrayRefat::indexing::TensorIndex indices) const {
TORCH_CHECK(indices.size() > 0, “Passing an empty index list to Tensor::index() is not valid syntax”);
OptionalDeviceGuard device_guard(device_of(*this));
return at::indexing::get_item(*this, indices);
}

I suspect the device guard is heavy. Further, the get_item call is here:

github.com

pytorch/pytorch/blob/09660896c0dd2bec888857300a7be9edb52dd05d/aten/src/ATen/TensorIndexing.h

#pragma once

#include <c10/util/Optional.h>
#include <ATen/core/TensorBody.h>
#include <ATen/ExpandUtils.h>
#include <ATen/Functions.h>

namespace at {
namespace indexing {

const int64_t INDEX_MAX = std::numeric_limits<int64_t>::max();
const int64_t INDEX_MIN = std::numeric_limits<int64_t>::min();

enum class TensorIndexType { None, Ellipsis, Integer, Boolean, Slice, Tensor };

constexpr c10::nullopt_t None{c10::nullopt_t::init()};

struct CAFFE2_API EllipsisIndexType final { EllipsisIndexType() {} };
CAFFE2_API extern const EllipsisIndexType Ellipsis;

This file has been truncated. show original

self_device is even pass around. My guess is that once CPU or GPU device stuff is involved. It is going to be expensive.