How to write a custom cpu kernel

I have a custom op for which I implemented a cuda kernel (following the instructions in https://pytorch.org/tutorials/advanced/cpp_extension.html).

Now I want to have a CPU implementation of the same op:

  • The tutorial shows how to call methods of Tensor from c++, but my op cannot be decomposed into built-in functions (need the equivalent of a cuda kernel on cpu).
  • Can I still use tensor.packed_accessor<...>() and AT_DISPATCH_FLOATING_TYPES, and then just call my own c++ function instead of invoking a cuda kernel?
  • Are there any macros/helper functions for vectorization / multi-threading? or should i just use OpenMP or similar myself?
  • Any other docs, resources, or examples in the codebase that would be helpful?
  • For CPU tensor, I think the equivalent would be Tensor.accessor<…>
    or you can use a vanilla Tensor.data_ptr<…>.
  • You can use at::parallel_for to enable OpenMP or other multi-threading libraries. Using at::parallel_for in a custom operator

Thanks, this was really helpful! Got it working.

Note that for largeish operations, using accessors is really inefficient - you use element by element arithmetic. You may see much better results by using vectorized (AVX, NEON, …) intrinsics even if that means that you need to take care of contiguous tensors yourself.

Best regards

Thomas

So just to clarify, the inefficiency is because accessors preclude the use of vectorized intrinsics?

  • Suppose that I made the tensors contiguous & operated on the pointer directly, but did not use intrinsics
  • In this case, I would have to do the index arithmetic myself
  • Using accessors should be at least as good as doing this?

So the tensor accessors are a really more or less equivalent to doing pointers and pointer arithmetic (this is why the dimension is part of the template, really).
If you have a really good compiler, it might do the vectorization, but last I tried, writing intrinsics got things to run faster.

Best regards

Thomas