Why indices tensor behavior in cuda environment is weird?

When I use indices tensor in cuda environment, it took about 0.077 second to finish. But when adding “print(x)”, it cost only 7*10^-5 second? Please help me, Thank you so much
indices tensor behavior

CUDA operations are executed asynchronously so you would need to synchronize your code via torch.cuda.synchronize() before starting and stopping host timers. Your current profiling is thus invalid and synchronizing operations, such as print statements of device tensors, will accumulate the running time of all launched and queued kernels.