CPU to GPU becomes slower after 1st iteration

I am dealing with a huge matrix, but my GPU memory is limited.
So, I chunk the huge matrix on CPU and use for-loop to transfer smaller chunked one to GPU one-by-one. However, the CPU-to-GPU becomes slower after 1st iteration.

sample codes:

chunks = torch.chunk(huge_matrix, chunks=10, dim=0) # on CPU
for i in range(len(chunks):
    x = chunks[i].to(device) # device: cuda
    # 1st: 0.75s, 2nd: 3.49s, 3rd: 3.65s, ....

Seems like if I changed as follows,

x = chunks[i].pin_memory().to(device, non_blocking=True) # device: cuda:0

if I need to assess these codes multiple time for one epoch, from 2nd time, the transfer time is greatly reduced, but why is it?

How are you profiling the code?
Since CUDA operations are executed asynchronously, you would have to synchronize the code before starting and stopping the timers.
Otherwise your profiles will be wrong and would accumulate the runtime of previous kernels into a blocking operation.