PackedTensorAccessor is slow?

The PackedTensorAccessor32 is faster than PackedTensorAccessor64, and is it slower than the original pointer?

While 64 vs 32 matters on the GPU (because 64 bit arithmetic for indexing is much slower), I’ve not seen a visible impact of using accessors vs raw pointer arithmetic. (Let’s say at least for regular fp32 things, I don’t really know about fp16, where you want to load end bloc even more.)
On CPU, using accessors most likely means not using vectorization (AVX2 and similar), which makes it slow.

Best regards