Sliding window over image

Hi, i’m trying to compute a dot product between a sliding window to an image(shaped (1024,2048)), in a way that may be resemble to convolution operation.
My goal is that each pixel score will be dependent on it environment scores. currently i came up with two ways of doing it:

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

kernel_size = 3
pmap = torch.rand((1024, 2048)).to(device)
kernel = torch.rand((3, 3)).to(device)

padding = tuple([kernel_size//2] * 4)
padded_pmap = torch.nn.functional.pad(pmap, padding)

# method1
patches = padded_pmap.unfold(0,kernel_size,step=1).unfold(1,kernel_size,step=1)
kerenelized_pmap = (patches * kernel).sum((2,3))

# method2
kerenelized_pmap2 = torch.zeros_like(pmap)
for i in range(padded_pmap.shape[0] - kernel_size):
    for j in range(padded_pmap.shape[1] - kernel_size):
        kerenelized_pmap2[i,j] = (padded_pmap[i:i+kernel_size,j:j+kernel_size] * kernel).sum((0,1))

the problem is that method1 uses a lot of memory and if i want to use large kernels(50+), i will exceed my device memory(and even with smaller kernels)
and method2 is very slow. My next step will be to try split the image and then use method1,but i wanted to see maybe someone already faced similar issues and has better solutions.