Apply median filter on image stack with gpu

anyfin22 · May 24, 2022, 8:34pm

I’m new to pytorch.

I have roughly 1000 images of size 250*1000, where roughly 60% of the pixel values are nan.

Currently I’m processing these on a CPU with matlab and it’s slower than I wanted.

I’m trying to put the processing on GPU, and using PyTorch tensor was suggested by a friend.

One of the steps that takes long is to apply median filter to each pixel of each slice, if it’s not nan.

Here’s my code:

It turns out the same code on CPU runs only 2 seconds, while GPU takes roughly 30 seconds. Then I’m repeating this 1000 times.

I think there’s something ignorant that I’m doing that’s making the GPU slow.

ptrblck · May 25, 2022, 12:05am

Based on the posted screenshot I would assume you could use unfold to create the image patches and call nanmedian() on these. While this would increase the memory usage, it should also speed up the workload (assuming it can fit into your GPU).

PS: you can post code snippets by wrapping them into three backticks ```, which makes debugging easier

anyfin22 · May 25, 2022, 2:54am

Thank you for the reply. I’ve tried unfolding the images like below. The unfolded becomes something of size 250*1024*441, then I called nanmedian() on each patch. which ended up being even slower actually…

I think the double for loops is what’s killing the time

Do you know any way other than nanmedian() to achieve the same thing?

Is there anyway that I can interpolate the values in the third dimension to fill all the 441 values to eliminate all the NaN values? then I can simply take the middle index… (Found out sorting along third column takes no time on GPU which is fantastic)

Or is there a way to set half of NaN to positive infinity and the other half to negative infinity?

I’m still exploring, in the meantime if you know a quick answer that would be awesome.

x=S_pad
kh, kw = 21, 21 # kernel size
dh, dw = 1, 1 # stride
patches = x.unfold(0, kh, dh).unfold(1, kw, dw)
patches=torch.flatten(patches, start_dim=2)
unfold_shape = patches.size()
print(unfold_shape)
patches,indices =torch.sort(patches,dim=2)
S_med=S_pad
start=time.time()
for i in range(0,250):
    for j in range(0,1024):
        S_med[i+10][j+10]=patches[i,j,:].nanmedian()
end=time.time()
print ("Time elapsed:", end - start)

P.S. the ``` method looks awesome

anyfin22 · May 25, 2022, 3:20am

Very close now. Given patches is sorted along the third dimension, the code below gives the index of the median. Now I need to find out how to extract the median from patches given these indices…

mask=~patches.isnan()
median_index=torch.floor(torch.sum(mask,2)/2)

ptrblck · May 25, 2022, 3:25am

My idea was to eliminate the nested for loop, permute the actual patches into a single dimension, and then call nanmedian with the dim argument on it.

anyfin22 · May 25, 2022, 3:52am

ohhh I’ve been using it like tensor.nanmedian() and forgot that it can take dim as input. Thanks this helps alot

anyfin22 · May 26, 2022, 4:36am

Ideally, I want to nicely put all my slices in one tensor, but that would take 400GB of GPU memory. So I’m running things in batches with a for loop.

The first iteration is pretty quick, but the second and onwards are a couple of orders of magnitudes slower than the first…

i.e.

0.2seconds is nice but 9 seconds for 20 of them is roughly how long it would take on CPU…

I suppose I need to somehow clean up some gpu memory between every batch.
I’ve tried to delete the unfolded tensors with del Tensor then torch.cuda.empty_cache(), but this doesn’t help.

Any suggestions?

Thanks.

ptrblck · May 26, 2022, 4:39am

empty_cache() will slow down your code further as new memory would need to be allocated via synchronizing cudaMalloc calls.
How are you measuring the performance? Often users are not synchronizing the code and since CUDA operations are executed asynchronously, you would need to synchronize the code via torch.cuda.synchronize() before starting and stopping the timers.

anyfin22 · May 26, 2022, 5:06am

Yeah I wasn’t synchronizing the code. the first iteration time is actually not synchronized…

Here’s my code. Since most of the values in these images are nan, and I don’t need to calculate median for those. Right now I’m calculating median for all then mask the ones I don’t need, which is dumb but it was what I found that produced correct result

How can I calculate median only if they are not nan? while maintaining the original shape and keeping the nan if they are.

#test gpu version
if torch.cuda.is_available():  
    dev = "cuda:0" 
else:  
    dev = "cpu"  
device = torch.device(dev) 

for i in range(0,40):
    torch.cuda.synchronize()
    start = time. time()
    #load proj

    S_ini=S_correct_tensor[:,:,i*20:(i+1)*20-1]
    size_filter=10
    #padding
    S_pad=torch.cuda.FloatTensor(S_ini.size(dim=0)+2*size_filter,S_ini.size(dim=1)+2*size_filter,S_ini.size(dim=2)).fill_(float('nan'))
    S_pad[size_filter:size_filter+S_ini.size(dim=0),size_filter:size_filter+S_ini.size(dim=1),:]=S_ini;
    #unfold image patches to third dimension
    kh, kw = size_filter*2+1, size_filter*2+1 # kernel size
    dh, dw = 1, 1 # stride
    patches = S_pad.unfold(0, kh, dh).unfold(1, kw, dw)
    patches = torch.flatten(patches, start_dim=3)

    #get nan mask
    mask_isnan=torch.isnan(S_ini).to(device)
    # #calculate median along third dimension
    S_med_gpu,indices=torch.nanmedian(patches,3)
    del patches
    del S_pad
    
    # #set previously nan values back to nan. 
    S_med_gpu[mask_isnan]=float('nan')
    torch.cuda.synchronize()
    end=time.time()

    print ("Median filter for 20 projections time elapsed:", end - start)```

mendi · January 30, 2023, 9:11am

assuming odd filter size k (1,3,5,7…):

import torch.nn.functional as F
median_filter = lambda x, k: F.pad(x.unfold(-2, k, 1).unfold(-2, k, 1).flatten(-2).median(-1)[0], (k // 2,) * 4, ‘replicate’)