Parallelism of a custom loop on GPU in python

Hi all,
I am trying to travel an image with a 50x50 kernel, infer on that 50x50 sub-image image and draw a bounding box corresponding to the class.
essentially something like -

for i in range(image.height):
    for j in range (image.width):
        out = model(image[:, : , i-25 : i+25, i-25 : i+25])

where inference would be on the GPU, is there any way I could parallelize this process in python or use the PyTorch jit, something similar to the parallel_for in C++…

TIA

I think your best bet is to create the patches beforehand (e.g. using unfold), push these patches into the batch dimension, and try to perform the forward pass once.
Afterwards, you could reshape the output and process each patch using your bounding box approach.