I find that indexing an at::Tensor using libtorch is much slower than indexing a C++ vector, is that normal?
Demo code like:
#include <torch/script.h> // One-stop header.
#include <iostream>
#include <memory>
#include <string>
#include <chrono>
int main(int argc, const char* argv[]) {
at::Tensor test_tensor(at::zeros({16000, 100, 4}, at::kFloat));
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < 80000; ++i) {
auto idx = test_tensor[15999][0][0];
}
auto end = std::chrono::steady_clock::now();
double time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "tensor indexing cost " << time << " ms." << std::endl;
std::vector<float> v1(4);
std::vector<std::vector<float>> v2(100, v1);
std::vector<std::vector<std::vector<float>>> test_vector(16000, v2);
start = std::chrono::steady_clock::now();
for (int i = 0; i < 80000; ++i) {
auto idx = test_vector[15999][0][0];
}
end = std::chrono::steady_clock::now();
time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "vector indexing cost " << time << " ms." << std::endl;
}
ouput is:
tensor indexing cost 153 ms.
vector indexing cost 0 ms.
Environment
ubuntu 18.04
cmake 3.15.5
g++ 7.5.0
libtorch 1.6.0+cu101
GPU Intel® Core™ i7-8750H
Lin_Jia
(Lin Jia)
September 23, 2020, 5:20am
2
See this:
opened 05:04PM - 22 Sep 20 UTC
🐛 Bug
I have a Libtorch project that has some execution speed issues. I have spent half a day to debug it....
Accessor is much faster but still slower than vector. Regarding why index is so slow, look at this:
#include <ATen/TensorIndexing.h>
#include <c10/util/Exception.h>
namespace at {
namespace indexing {
const EllipsisIndexType Ellipsis = EllipsisIndexType();
std::ostream& operator<<(std::ostream& stream, const Slice& slice) {
stream << slice.start() << ":" << slice.stop() << ":" << slice.step();
return stream;
}
std::ostream& operator<<(std::ostream& stream, const TensorIndex& tensor_index) {
if (tensor_index.is_none()) {
stream << "None";
} else if (tensor_index.is_ellipsis()) {
stream << "...";
} else if (tensor_index.is_integer()) {
This file has been truncated. show original
Tensor Tensor::index(ArrayRefat::indexing::TensorIndex indices) const {
TORCH_CHECK(indices.size() > 0, “Passing an empty index list to Tensor::index() is not valid syntax”);
OptionalDeviceGuard device_guard(device_of(*this));
return at::indexing::get_item(*this, indices);
}
I suspect the device guard is heavy. Further, the get_item call is here:
#pragma once
#include <c10/util/Optional.h>
#include <ATen/core/TensorBody.h>
#include <ATen/ExpandUtils.h>
#include <ATen/Functions.h>
namespace at {
namespace indexing {
const int64_t INDEX_MAX = std::numeric_limits<int64_t>::max();
const int64_t INDEX_MIN = std::numeric_limits<int64_t>::min();
enum class TensorIndexType { None, Ellipsis, Integer, Boolean, Slice, Tensor };
constexpr c10::nullopt_t None{c10::nullopt_t::init()};
struct CAFFE2_API EllipsisIndexType final { EllipsisIndexType() {} };
CAFFE2_API extern const EllipsisIndexType Ellipsis;
This file has been truncated. show original
self_device is even pass around. My guess is that once CPU or GPU device stuff is involved. It is going to be expensive.