Pre-load tensors to GPU

I want to hide the time needed to load tensors from CPU to GPU. This can in general be done by using gpu_tensor = tensor.cuda(non_blocking=True).
This will block as soon as I access the gpu_tensor. But let’s suppose I do some work only after I access the tensor the first time on the GPU. To still be able to hide the CPU-GPU-movement time, I would like to load the tensor one iteration ahead.
Here is a little code example:

    tensors = [torch.rand((20000, 800), dtype=torch.float32).pin_memory(),
               torch.rand((20000, 800), dtype=torch.float32).pin_memory()]

    gpu_tensors = [torch.rand((20000, 800), dtype=torch.float32, device="cuda"),
                   torch.rand((20000, 800), dtype=torch.float32, device="cuda")]

    if pre_load:
        gpu_tensors[0].copy_(tensors[0], non_blocking=False)
    for batch_index in range(num_batches):
        to_gpu_time -= time.time()
        if pre_load:
            gpu_tensors[(batch_index+1)%2].copy_(tensors[(batch_index+1)%2], non_blocking=True)
        else:
            gpu_tensor = tensors[0].to(device, non_blocking=True)
        to_gpu_time += time.time()

        tensor_access_time -= time.time()
        if pre_load:
            gpu_tensor = gpu_tensors[batch_index%2]
        sum_result = gpu_tensor.sum().item()
        tensor_access_time += time.time()

        # do some work
        total_sleep_2_time -= time.time()
        if sleep_2:
            time.sleep(sleep_time)
        total_sleep_2_time += time.time()

        if write_back:
            write_back_time -= time.time()
            cpu_tensor = gpu_tensor.cpu()
            write_back_time += time.time()

In theory it should be possible to hide the CPU-GPU-movement time in the sleep time after the tensor access if we are pre-loading one tensor ahead. But unfortunately this seems not to work this way.
Do you have an idea why I am still losing time in the tensor_access part if I pre-load a tensor or do you know how I could solve this problem?

Host to device copies should be executed asynchronous, if the source tensor is pinned.
I would recommend to use e.g. Nsight and inspect the kernel calls visually.