Libtorch tensor.permute() - malloc(): corrupted top size

I’m currently developing an op that operates over a contiguous final dimension. Both the cpu and gpu variants use the same code to pre-shape input and post-shape the output tensor. Currently it seems only the cpu variant has an issue with reshaping the output: malloc(): corrupted top size. I’ve tried a few combinations of view/reshape assigning a new tensor or mutating inplace, but I keep getting this problem. Again, the gpu variant is fine (even with same code), just the cpu is a problem, all the requested shapes look okay with the good ol print debugging (requested permute is just 1,0 on a 2dim tensor).

void restoreOutputShape(torch::Tensor& output, c10::IntArrayRef inShape, int64_t dim)
{
    output = output.reshape(getPermutedShape(inShape, dim));
    if (dim != (output.ndimension() - 1))
    {
        output = output.permute(getReversePermutation(output.ndimension(), dim));
    }
    output = output.contiguous();
}

Open sourcing now even though still WIP so it can be inspected.

There was a bad stride calculation in the cpu kernel, rather than the post-process reshaping. Just classic UB things, error gets thrown not in the place where the actual memory bug is :upside_down_face:.