How to Parallelize Convolution Operation?

I have the following Kernel:

{'padding': 2, 'weight': tensor([[[[0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0.]]]], device='cuda:0')}

and I have an image of size [B, 320, 256, 128]

I want to apply this kernel to each Channel (treat each channel as a 2D image and apply a standard 2D convolution)

Currently I do it like this, but it is very slow:

    for b in range(0, dpv_permuted.shape[0]):
        for c in range(0, dpv_permuted.shape[1]):
            dpv_permuted[b,c,:,:] = F.conv2d(dpv_permuted[b,c].unsqueeze(0).unsqueeze(0), **spread_kernel).squeeze(0).squeeze(0)

How do i make this faster?

    test = torch.nn.Conv2d(in_channels=320, out_channels=320, groups=320, kernel_size= (5,5), padding=5//2, bias=False).cuda()
    test.weight.requires_grad = False
    test.weight[:,:,:,:] = kernel
    out = test(dpv_permuted)