Creating and manipulating Tensors with ATen in C++ extension

Hi all,
I am writing a C++ extension for an operation that I need to perform that will be far too slow with Python. Part of the algorithm involves constructing Matrices and Vectors from intermediate values(requiring their manual construction) that cannot be constructed ahead of time in Python.

I have been looking at the ATen Tensor and Functions API’s, however it seems that the objects and routines defined here are intended to be used with preexisting TensorImpl instances.

Does anybody know how one would construct a temporary Tensor of arbitrary(but small) shape(such as simple Matrices and Vectors) in a C++ extension? These Tensors are only required for intermediate computations and will not be returned by the C++ extension.


1 Like

Here are some working examples:

auto ious = at::zeros(xy.type(), {10, 3, 1, 1, 2});
auto gt_target = at::empty(at::CUDA(at::kInt), {10, 3, 1, 1, 2});
auto target_obj_obj = obj.clone();
auto target_wh = warmup ? at::zeros_like(wh) : wh.clone();
auto target_xy = warmup ? at::full_like(xy, 0.5) : xy.clone();
1 Like

Brilliant! To my understanding however, it seems that the above examples are to be run on the host side with CUDA Tensors. What if one wished to perform a small Matrix Multiplication within each CUDA kernel?

You can intermix this with CUDA kernels, but you cannot call them from within the kernel itself. While things like matrix multiplication already work (no need to kernels) it is simple to do to write a grid-stride loop to do that in a kernel if you want.

Ah, as I suspected/feared. I have been using the documentation that you linked.

I assume that the same goes for allocation of small, temporary tensors in each thread?

The problem is that each thread must perform mm and mv operations on these temporary tensors. Splitting each part of the algorithm out into multiple kernel calls would be very difficult and inefficient as the operations required depend largely on the internal logic of each thread.

Back to the drawing board. Thank you.

I find it easier to create tempraries once and pass them to kernels (as scratch space), also I write independent kernels and use a flat index and grid-stride loops; but if you have to, you can use dynamic parallelism.
I also asked about efficiency of temporaries, and it seems to be Ok to just create them on the fly when needed. I tested it, and so far it looks like PyTorch’s custom cuda allocator does a great job when it comes to memory allocations.