Improving Pytorch kernel performance

My code (for a somewhat degenerate matmul) looks like this:

def opt_foo(out, i1, i2, j1, j2, idx):
    out[i1*512:i2*512].view(i2-i1, 512).add_(torch.mm(A[idx:idx+(i2-i1)*(j2-j1)].view(j2-j1,i2-i1).t(), B[j1*512:j2*512].view(j2-j1, 512)))

opt_foo(out, 0, 200, 900, 1000, 0)
opt_foo(out, 200, 400, 1500, 1600, 20000)
opt_foo(out, 600, 800, 3100, 3200, 40000)
opt_foo(out, 1000, 1200, 1000, 1100, 60000)
opt_foo(out, 1000, 1200, 1500, 1600, 80000)
opt_foo(out, 1200, 1400, 400, 500, 100000)
opt_foo(out, 1400, 1600, 6400, 6500, 120000)
opt_foo(out, 1400, 1600, 7600, 7700, 140000)
opt_foo(out, 1600, 1800, 900, 1000, 160000)
opt_foo(out, 1600, 1800, 2400, 2500, 180000)

A, B and out are all tensors on the GPU.
The different views are to prevent transposing the tensor.
i1 and i2 represent rows. Similarly, j1 and j2 represent columns. This is traversed in row major order, implying that the values of i1 and i2 are always either increasing or the same as the previous call. Also, i2>i1 and j2>j1 always because they represent the bounds.

Let’s say the current time taken by this is x ms.

I have tried inlining the kernel. Performance is still ~x ms.

I have tried compiling opt_foo using torch.compile (outside of the part I time). It is much slower than x ms.

I tried storing the result in an intermediate list instead of updating out for each call. Note how there is a dependency between outs that have the same value of i1, so I did have to update out between calls having the same i1s. I also tried merging the update to out for calls where the i2 of the previous call is the same as the i1 of the new call. However, still the time taken was ~x ms.

When profiling it using PyTorch, I saw the most time taken by CUDA was in aten::mm and ampere_sgemm_64x32_sliced1x4_nt.
On the CPU, most time was taken in aten::mm, aten::add_ and cudaLaunchKernel.

Does it mean anything that it seems that aten::mm seems to spend more time on the CPU than on CUDA?

How could I further optimize this?