Performing matrix-style operations in place

Hi all.

I’m often doing this:

torch::Tensor c = torch::matmul(a, b);

And I’m aware of a lot of unnecessary allocating, when I would really like to have the result of the matmul stored in b (or a).

Is there an easy way to achieve this?

No and this is inherent to the memory access pattern (every value is multiple times to compute other values).

If you wanted to reduce the memory consumption, you could split it to smaller patches along the rows of a (or the columns of b) and overwrite blocks. In the extreme case you would allocate a buffer for a single row, perform the row-wise operation with matmul_out to that buffer and then copy the row to a.

Best regards

Thomas

1 Like

Thanks Tom.

I don’t mind the memory usage, I’m more concerned about the time of the allocs - does PyTorch use a decent memory manager to re-use memory?

On the GPU it has its own cache, I think on mobile, too. On the CPU it uses posix_memalign which relies on libc to do caching (I think they do in contemporary systems), so I would not worry about it.
You can also reserve the memory once and use matmul_out if you love manual, but very likely it will not be the largest lever you have in optimization of any given model.
There is ongoing work on a static runtime, which also involves static memory planning.

Best regards

Thomas

1 Like