Computing to a sub-tensor portion of the output tensor?

Hello!

Is there a way to update the computation of a certain sub-tensor portion of an output tensor without re-computing the whole output tensor?

To be specific, I have a series of similar input matrices with very small patches of different values scattered around. I am trying to perform matrix multiplication on these matrices, but to reduce computation and memory usage, I would like to compute using only the changed portion of the input and update the output matrix in-place, after the initial computation.

I can locate the changes, and use slices of the input and weight matrix with the linear op (torch.nn.functional.linear) to compute only the patch that needs to be updated on the output. However, I still need to copy and scatter the output patch to the original output tensor to actually update it. I know that using BLAS libraries I can achieve in-place computation and update by setting the variable m to the number of rows of the submatrix, ldc to the number of rows of the original matrix, and C to the start of the submatrix, during the GEMM function call. (Assuming column-major matrices)

Is there a way to similarly specify the storage offset, stride, and data_ptr of the output tensors for the given PyTorch op (matrix multiplication, convolution, add, etc.)? Or will they always create a new contiguous tensor for output?

Thank you in advance for your help!