Tensor.item() takes a lot of running time

As previously described, item() will synchronize the code and wait for the GPU to finish its computations, since you are explicitly transferring the tensor to the CPU and are creating a Python literal. Since its value must be known before the operation is executed, the code is synchronized.
If you want to avoid the synchronizations, use item() less often or store the detached CUDATensor and print it later (once you are fine with a sync point).