My code (for a somewhat degenerate matmul) looks like this:
def opt_foo(out, i1, i2, j1, j2, idx):
out[i1*512:i2*512].view(i2-i1, 512).add_(torch.mm(A[idx:idx+(i2-i1)*(j2-j1)].view(j2-j1,i2-i1).t(), B[j1*512:j2*512].view(j2-j1, 512)))
opt_foo(out, 0, 200, 900, 1000, 0)
opt_foo(out, 200, 400, 1500, 1600, 20000)
opt_foo(out, 600, 800, 3100, 3200, 40000)
opt_foo(out, 1000, 1200, 1000, 1100, 60000)
opt_foo(out, 1000, 1200, 1500, 1600, 80000)
opt_foo(out, 1200, 1400, 400, 500, 100000)
opt_foo(out, 1400, 1600, 6400, 6500, 120000)
opt_foo(out, 1400, 1600, 7600, 7700, 140000)
opt_foo(out, 1600, 1800, 900, 1000, 160000)
opt_foo(out, 1600, 1800, 2400, 2500, 180000)
A
, B
and out
are all tensors on the GPU.
The different views are to prevent transposing the tensor.
i1
and i2
represent rows. Similarly, j1
and j2
represent columns. This is traversed in row major order, implying that the values of i1
and i2
are always either increasing or the same as the previous call. Also, i2>i1
and j2>j1
always because they represent the bounds.
Let’s say the current time taken by this is x ms
.
I have tried inlining the kernel. Performance is still ~x ms
.
I have tried compiling opt_foo
using torch.compile
(outside of the part I time). It is much slower than x ms
.
I tried storing the result in an intermediate list instead of updating out
for each call. Note how there is a dependency between out
s that have the same value of i1
, so I did have to update out
between calls having the same i1
s. I also tried merging the update to out
for calls where the i2
of the previous call is the same as the i1
of the new call. However, still the time taken was ~x ms
.
When profiling it using PyTorch, I saw the most time taken by CUDA was in aten::mm
and ampere_sgemm_64x32_sliced1x4_nt
.
On the CPU, most time was taken in aten::mm
, aten::add_
and cudaLaunchKernel
.
Does it mean anything that it seems that aten::mm
seems to spend more time on the CPU than on CUDA?
How could I further optimize this?