How to write a custom cpu kernel

nikhilmishra000 · June 11, 2020, 8:23pm

I have a custom op for which I implemented a cuda kernel (following the instructions in https://pytorch.org/tutorials/advanced/cpp_extension.html).

Now I want to have a CPU implementation of the same op:

The tutorial shows how to call methods of Tensor from c++, but my op cannot be decomposed into built-in functions (need the equivalent of a cuda kernel on cpu).
Can I still use tensor.packed_accessor<...>() and AT_DISPATCH_FLOATING_TYPES, and then just call my own c++ function instead of invoking a cuda kernel?
Are there any macros/helper functions for vectorization / multi-threading? or should i just use OpenMP or similar myself?
Any other docs, resources, or examples in the codebase that would be helpful?

Kaii · June 13, 2020, 5:07pm

For CPU tensor, I think the equivalent would be Tensor.accessor<…>
or you can use a vanilla Tensor.data_ptr<…>.
You can use at::parallel_for to enable OpenMP or other multi-threading libraries. Using at::parallel_for in a custom operator

nikhilmishra000 · June 15, 2020, 7:03pm

Thanks, this was really helpful! Got it working.

tom · June 15, 2020, 7:07pm

Note that for largeish operations, using accessors is really inefficient - you use element by element arithmetic. You may see much better results by using vectorized (AVX, NEON, …) intrinsics even if that means that you need to take care of contiguous tensors yourself.

Best regards

Thomas

nikhilmishra000 · June 15, 2020, 7:35pm

So just to clarify, the inefficiency is because accessors preclude the use of vectorized intrinsics?

Suppose that I made the tensors contiguous & operated on the pointer directly, but did not use intrinsics
In this case, I would have to do the index arithmetic myself
Using accessors should be at least as good as doing this?

tom · June 15, 2020, 8:26pm

So the tensor accessors are a really more or less equivalent to doing pointers and pointer arithmetic (this is why the dimension is part of the template, really).
If you have a really good compiler, it might do the vectorization, but last I tried, writing intrinsics got things to run faster.

Best regards

Thomas