Now I want to have a CPU implementation of the same op:
The tutorial shows how to call methods of Tensor from c++, but my op cannot be decomposed into built-in functions (need the equivalent of a cuda kernel on cpu).
Can I still use tensor.packed_accessor<...>() and AT_DISPATCH_FLOATING_TYPES, and then just call my own c++ function instead of invoking a cuda kernel?
Are there any macros/helper functions for vectorization / multi-threading? or should i just use OpenMP or similar myself?
Any other docs, resources, or examples in the codebase that would be helpful?
Note that for largeish operations, using accessors is really inefficient - you use element by element arithmetic. You may see much better results by using vectorized (AVX, NEON, …) intrinsics even if that means that you need to take care of contiguous tensors yourself.
So the tensor accessors are a really more or less equivalent to doing pointers and pointer arithmetic (this is why the dimension is part of the template, really).
If you have a really good compiler, it might do the vectorization, but last I tried, writing intrinsics got things to run faster.