I have a big tensor X whose shape is [?, 18900, 85] and another big tensor Y whose shape is [18900,].
Data of X is float and data of Y is boolean.
The last broadcast operation below takes about 25ms if X and Y both in GPU:
X = YoloModel(image)
# ...
# non max suppresion preprocess
Y = X[..., 4] > conf_thres
for i, x in enumerate(X):
# x shape is [18900, 85]
x = x[Y[i]] # 25ms!!
# X shape now is [47, 85] for example
# ...
Althought it takes less than 1ms when X and Y both reside in CPU, calling X.cpu() would take roughly 25ms. So I want to know how to fix this problem.
Thanks for your help!
I found an equivalent operation:
for i, x in enumerate(X):
x = torch.masked_select(x, Y[i].repeat(x.shape[1], 1).T).reshape(-1, x.shape[1])
The operation is still slow though. It seems that GPU is not good at such kind of task…
I got more clear code below after I’ve read the doc. And I find that the performance problem comes from ‘nonzero’ function
mask = Y[i].nonzero().squeeze()
x = torch.index_select(x, 0, mask)
Then I found the same issue on google:
torch.nonzero slower than np.nonzero · Issue #14848 · pytorch/pytorch (github.com)
GPU version ‘nonzero’ function is still much slower than that of CPU version
It seems quite hard to parallel nonzero since it involves ‘index’.
I think I’ve got the right idea: slice the long tensor and call nonzero on multiprocessing.