#include <torch/extension.h>
#include <iostream>
#include <vector>
std::vector<at::Tensor> lltm_forward(torch::Tensor dummy,
torch::Tensor mult1, int element_per_filter, int mult1_shape_0,
int y_index_2, int mult1_shape_1)
{
// auto dummy1 = dummy.accessor<float, 2>();
// auto mult11 = mult1.accessor<float, 2>();
int iterator = 0;
float number = 0;
//mult1 = mult1.data<float>;
int temp = 0;
int filternumber = -1;
for (int i = 0; i < mult1_shape_0 + 1; i = i + y_index_2)
{
for (int j = 0; j < mult1_shape_1; j++)
{
number = 0;
for (int k = iterator; k < i; k++)
{
number = number + mult1[k][j].item().to<double>();
temp = k;
}
dummy[temp][j] = number;
return {dummy};
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
{
m.def("nice", &lltm_forward, "LLTM forward");
}
This code works but it around 400 x slower than it’s python variant. When i replace mult1[k][j].item().to(); with a float number (let say i from the loop) it becomes 66% faster which is what i want from this operator. I was thinking if a float tensor is passed at the python end instead of torch::Tensor mult1 and i loop through it it would be faster (by just doing mult1[k][j]), i have tried to convert mult1 to float as follows but couldn’t make that work either.
auto x = mult1.to(torch::kFloat32);
Which still gives:
: error: cannot convert ‘at::Tensor’ to ‘float’ in assignment
number = number + x[k][j]; //.item().to<double>();
Guidance on this issue is required from respected members. Is it possible to send numpy arrays to C++ operator perform looping on it and return a numpy array. I have tried that too but could not make to work yet.
Thank you for your reply. Actually Tensor were converted to numpy for python looping, but i was thinking c++ loops are much better so a custom operator is a good choice. Python loops run on numpy arrays. Here is a snippet to demonstrate both parts
import time
#######################
# Calling C++ made operator here
a = torch.from_numpy(dummy).to(torch.float)
b = torch.from_numpy(mult1).to(torch.float)
start = time.time() # Torch passed to c++ operator a and b
nani = nice.nice(
a, b, element_per_filter, mult1.shape[0], yindex_2, mult1.shape[1])
#######################
end = time.time()
print(end - start, "C++")
start = time.time()
for i in range(0, mult1.shape[0]+1, yindex_2): # python looping on numpy arrays
for j in range(0, mult1.shape[1], 1):
number = 0
for k in range(iterator, i, 1):
number = number +mult1[k][j]
temp = k
dummy[temp][j] = number
end = time.time()
print(end - start, "Python Loop")
I had similar experience with libtorch. Looking at your code, you are not using any libtorch specific functionality that binds to using tensors only. So you should be able to easily swap tensor with a vector.
Just converting the mult1 from a tensor into a vector, would greatly boost your performance.
If it’s possible can you please provide a snippet wrt to this code. I am not using any tensor related functionality that’s why i want to move these loops to c++
I tried rewriting it with raw arrays (without really knowing what sizes of tensors you are using), seems much faster
std::vector<at::Tensor> lltm_forward2(torch::Tensor dummy, torch::Tensor mult1, int element_per_filter, int mult1_shape_0, int y_index_2, int mult1_shape_1)
{
TORCH_CHECK(dummy.is_contiguous(), " dummy must be a continous tensor");
TORCH_CHECK(mult1.is_contiguous(), " mult1 must be a continous tensor");
TORCH_CHECK(!dummy.is_cuda(), " dummy can't be CUDA tensor");
TORCH_CHECK(!mult1.is_cuda(), " mult1 can't be CUDA tensor");
auto dummyPtr = dummy.data_ptr<float>();
auto mult1Ptr = mult1.data_ptr<float>();
const size_t dummyPtr_stride = dummy.size(1);
const size_t mult1Ptr_stride = mult1.size(1);
int iterator = 0;
float number = 0.0f;
int temp = 0;
int filternumber = -1;
for (int i = 0; i < mult1_shape_0 + 1; i = i + y_index_2)
{
for (int j = 0; j < mult1_shape_1; j++)
{
number = 0.0f;
for (int k = iterator; k < i; k++)
{
number += mult1Ptr[k * mult1Ptr_stride + j];
temp = k;
}
dummyPtr[temp * dummyPtr_stride + j] = number;
}
}
return { dummy };
}