How to make torch.nonzero faster

I have a big tensor X whose shape is [?, 18900, 85] and another big tensor Y whose shape is [18900,].
Data of X is float and data of Y is boolean.

The last broadcast operation below takes about 25ms if X and Y both in GPU:

X = YoloModel(image)
# ...
# non max suppresion preprocess
Y = X[..., 4] > conf_thres
for i, x in enumerate(X):
    # x shape is [18900, 85]
    x = x[Y[i]]  # 25ms!!
# X shape now is [47, 85] for example
# ...

Althought it takes less than 1ms when X and Y both reside in CPU, calling X.cpu() would take roughly 25ms. So I want to know how to fix this problem.

Thanks for your help!

I found an equivalent operation:

for i, x in enumerate(X):
    x = torch.masked_select(x, Y[i].repeat(x.shape[1], 1).T).reshape(-1, x.shape[1])

The operation is still slow though. It seems that GPU is not good at such kind of task…

I got more clear code below after I’ve read the doc. And I find that the performance problem comes from ‘nonzero’ function

mask = Y[i].nonzero().squeeze()
x = torch.index_select(x, 0, mask)

Then I found the same issue on google:
torch.nonzero slower than np.nonzero · Issue #14848 · pytorch/pytorch (github.com)

GPU version ‘nonzero’ function is still much slower than that of CPU version

It seems quite hard to parallel nonzero since it involves ‘index’.

I think I’ve got the right idea: slice the long tensor and call nonzero on multiprocessing.

found right timing method here: Measuring GPU tensor operation speed - #4 by apaszke