Why does nanmedian run 10 times slower when 75% of the input is nan

I tried using torch.nanmedian on two tensors on gpu with the exact size and data type. except one is a bunch of random number and the other has roughly 75% nan.

The one with random number is ~15x faster…

I’m baffled by why that’s the case.

I cannot reproduce the issue and get similar times for:

x = torch.randn(1024, 1024, device='cuda')

for _ in range(10):
    _ = torch.nanmedian(x)

nb_iters = 1000
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(nb_iters):
    _ = torch.nanmedian(x)
torch.cuda.synchronize()
t1 = time.perf_counter()
print((t1 - t0)/nb_iters)


x = torch.randn(1024, 1024, device='cuda')
x[:900, :900] = torch.tensor(float('NaN'))
# shuffle
x = x[torch.randperm(1024), :]
x = x[:, torch.randperm(1024)].contiguous()

print(x.isnan().float().sum()/x.nelement())

for _ in range(10):
    _ = torch.nanmedian(x)

nb_iters = 1000
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(nb_iters):
    _ = torch.nanmedian(x)
torch.cuda.synchronize()
t1 = time.perf_counter()
print((t1 - t0)/nb_iters)

Hi ptrblck,

I’ve modified your code a little bit to have similar data as I did, this produced similar time as on my end, roughly 15x slower with a bunch of NaNs.

My GPU only has 12gb ram, there wasn’t space to do randperm, but I suppose this should only help it be faster if anything:

x = torch.randn(60000,100, 400, device='cuda')

for _ in range(10):
    _ = torch.nanmedian(x,2)

nb_iters = 10
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(nb_iters):
    _ = torch.nanmedian(x,2)
    
torch.cuda.synchronize()
t1 = time.perf_counter()
print((t1 - t0)/nb_iters)

del x
x = torch.randn(60000,100, 400, device='cuda')
x[:50000,:90, :300] = torch.tensor(float('NaN'))
# shuffle
# x = x[torch.randperm(60000), :,:]
# x = x[:, torch.randperm(100),:].contiguous()
# x = x[:, :,torch.randperm(400)].contiguous()

# print(x.isnan().float().sum()/x.nelement())

for _ in range(10):
    _ = torch.nanmedian(x,2)

nb_iters = 10
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(nb_iters):
    _ = torch.nanmedian(x,2)
torch.cuda.synchronize()
t1 = time.perf_counter()
print((t1 - t0)/nb_iters)
del x

I suspected that it’s because I was using GPU memory to near the limit, but the one without NaN doesn’t seem to be impacted.

Thank you very much

Thanks for the updated code! Based on this I would assume that you are seeing a decreased performance due to e.g. this atomic operation which would suffer from conflicts (i.e. many duplicated values).

I see. Thank you so much for the reply!

I have heard that cupy.nanmedian() uses partition and is faster than torch.nanmedian()'s sort. but I installed python 3.10 which cupy doesn’t support yet… so I’m trying everything I can before I go mess with my environment, since everything else was set up for python 3.10.