i try to run this operation after the forward ,it spend more than 100ms.it’s strange as the forward only need 2ms.
i thought it’s because the gpu free memory take lot of time. I try to do that let it sleep one second after the forward ,it run fast.
When you measure the runtime of forward, did you use cudaStreamSynchronize(...) (which is equivalent to torch.cuda.synchronize() in Python)? This can affect runtime measurement because CUDA is by default asynchronous.