The time consumption of torch.where() is a bit strange

Recently, my face detection code (which should be a model modified based on retina-face) needs to use torch.where to filter prediction boxes. I found that the main time-consuming post-processing is to calculate the torch.where line of code, specifically:
inds = torch.where(scores > args.confidence_threshold)[0],
where scores are confidence and are placed on the GPU. I found this line of code to be very time consuming and difficult to optimize. Moreover, the time it takes to execute torch.where() multiple times is the same as the time it takes to execute it once, about 100ms.
Any possible suggestions would be greatly appreciated!

Your profiling is most likely invalid as you are not properly synchronizing the code.
torch.where adds data-dependent control flow and thus synchronizes the code as also seen here:

torch.cuda.set_sync_debug_mode("warn")

x = torch.randn(10, device="cuda")
torch.where(x > 0.)
#  UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
#  torch.where(x > 0.)

If CUDA kernels are still executed on the device, this call will accumulate their execution time.
Check if this is the case in your use case.