Data Transfer slow from gpu to cpu

I tested the data transfer speeds between CPU to GPU and GPU to CPU, and there was a significant difference between the two.
PCIe: Speed 8GT/s, Width x8
result:
CPU->GPU:5.30 GB/s
GPU->CPU:0.93 GB/s
What’s the problem? How to accelerate data transfer time from GPU to CPU?
My test code is here:

import torch
import time

tensor_size = 1024 * 32  # 1GB
dtype = torch.float32

tensor = torch.randn((tensor_size, tensor_size), dtype=dtype, device='cuda')
targer_tensor = torch.zeros((tensor_size, tensor_size), dtype=dtype, device='cpu')
torch.cuda.synchronize()

start_time = time.time()

num_iterations = 10
for _ in range(num_iterations):
    tensor.cpu()

torch.cuda.synchronize()

end_time = time.time()

total_data_transferred_gb = (num_iterations * tensor_size * tensor_size * tensor.element_size()) / (1024**3)

average_bandwidth_gb_per_s = total_data_transferred_gb / (end_time - start_time)

print(f"Average bandwidth: {average_bandwidth_gb_per_s:.2f} GB/s")

OK,I tested pinned memory in CPU,which acclerated CPU<->GPU transfer to 11GB/s. But I wonder why it is so slow when using normal tensor.cpu(). Is it because it needs CPU memory allocation?