Strange time Consumption after forward with c++

i try to run this operation after the forward ,it spend more than 100ms.it’s strange as the forward only need 2ms.
i thought it’s because the gpu free memory take lot of time. I try to do that let it sleep one second after the forward ,it run fast.

boundingBoxesOfOne = torch::masked_select(boundingBoxesOfOne, mask).detach().cpu();

Is there any way to get him to run fast?
I will be very grateful for any help.

When you measure the runtime of forward, did you use cudaStreamSynchronize(...) (which is equivalent to torch.cuda.synchronize() in Python)? This can affect runtime measurement because CUDA is by default asynchronous.

Hello, I am experiencing the same issue, the operations takes 45ms for me, were you able to resolve it?