I am looking for some kind of way to transform a NxWxH LongTensor into a (NxWxH)x9 LongTensor. The transform would be to extract a square patch at each position, it would return a 2D tensor (NxWxH) lines x (number of elements in square) columns. It has to take into account that if a full square cannot be extracted, it should be filled with a value (border limit case), the original tensor can be padded before of course.
I already know how to do this with a for loop over each dimensions, however I am looking for a smarter way to break complexity, currently I can’t think of a way to use a vectorized op to do this. It needs to support autograd, has I don’t want to break the graph.

You can do a 2d convolution with a specific weight tensor of shape (9, 1, 3, 3) to achieve this… although the tensor will have a lot of 0 entries and conv fwd&bwd is not super optimized for this case. I’m not aware of any better way to do this other than write a custom cuda kernel though…

pad = nn.ConstantPad2D(1, filler)
input = pad(input.unsqueeze(1))
shifted = []
for i in range(9):
shifted.append(input[... shift it here])
output = torch.cat(shifted, 1)

@SimonW your approach is really interesting, however I am not sure to understand how I can implement the shift part, can you give a simple case like right or left shift please ? did you mean shifting with a convolution ?