When I use indices tensor in cuda environment, it took about 0.077 second to finish. But when adding “print(x)”, it cost only 7*10^-5 second? Please help me, Thank you so much
CUDA operations are executed asynchronously so you would need to synchronize your code via torch.cuda.synchronize()
before starting and stopping host timers. Your current profiling is thus invalid and synchronizing operations, such as print
statements of device tensors, will accumulate the running time of all launched and queued kernels.