Too slow conv2d and AvgPool2d for large kernel sizes

I have an image of size 2000 x 2000 pixels. I need to find the average from all the 200 x 200 windows. When I try in Matlab to find the average I am able to get the results quickly. However, when I use pytorch it takes a very long time/hangs. How can I quickly find the average of all the windows of size 200x200?

The approach I tried in Pytorch is:

        I = torch.rand(2000,2000)
        pool = nn.AvgPool2d(kernel_size=200, stride=1, padding=100)
        re = pool(I.unsqueeze(0))

You could push the data and pooling layer to the GPU for a potential speedup.
Note however, that you might fallback to the native im2col implementation for this particular use case, if e.g. cudnn cannot find a fast kernel for this workload.

But when I use the following

I = I.cuda(0)

I am getting an error “Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method”. Could you suggest me the cuda code for the image and layer?

To fix the error you could use the spawn method for multiprocessing as explained here.

Usually you don’t need multiprocessing and could simply use the GPU in your script.

Sorry for my limited understanding. I need to do this for all the images inside DataLoader. In that case do I need to use multiprocessing ?

You don’t need to use multiprocessing manually and can simply set the num_workers in your DataLoader to a larger number than 0.
This will use multiprocessing under the hood and each worker will load a complete batch in the background.

I am using the samething, num of workers > 0. But still I dont understand where I am using multiprocessing to get that error.

In that case the error might be raised, if you create CUDATensors in your Dataset.
The vanilla use case would be to create CPU tensors in the dataset, process them, and push them to the GPU inside your DataLoader loop.

What could be the problem here?

def __getitem__(self, index):
        target = self.lbls[index]
        # read image and convert to PIL image
        I =[index])[-1] 
        I = TF.to_pil_image(I, mode='RGB')
        # apply transformations, including totensor()
        I = self.transform(I)
        # here I need to transfer I to GPU so that I can apply pooling 
        I = I.cuda() # I am getting error here
        re = self.pool(I)

I need to do pooling to do some preprocessing of the image I.

Don’t push the I tensor to the device in your __getitem__ method.
Since each worker in your DataLoader will use an own dataset, you will run into these multiprocessing issues as mentioned before.

Remove I = I.cuda() from __getitem__ and move it to the training loop:

for data in loader:
    data = data.cuda()