Vectorized slice function

Is there a vectorized variant of the slice function or an alternative to achieve the following task?

Assume that I have a two-dimensional tensor A from which I want to access a slice, e.g.

A.index( { Slice(i-1, i+1, 1), Slice(j-1, j+1, 1) } )

This works well and gives me a 3x3 sub-tensor with entries A[i-1:i+1][j-1:j+1] as long as i and j are scalars. Instead of looping over a set of i’s and j’s I would like to ‘stack’ all of them along the third dimension so that I can apply some operation (in my case this is a matmul with another three-dimensional tensor) to all of them.

The reason for this request is that one-by-one processing requires me to store the result with index_put_ which is leading to problems when calculating gradients (see my other post here).