I want to hide the time needed to load tensors from CPU to GPU. This can in general be done by using gpu_tensor = tensor.cuda(non_blocking=True).

This will block as soon as I access the gpu_tensor. But let’s suppose I do some work only after I access the tensor the first time on the GPU. To still be able to hide the CPU-GPU-movement time, I would like to load the tensor one iteration ahead.

Here is a little code example:

```
tensors = [torch.rand((20000, 800), dtype=torch.float32).pin_memory(),
torch.rand((20000, 800), dtype=torch.float32).pin_memory()]
gpu_tensors = [torch.rand((20000, 800), dtype=torch.float32, device="cuda"),
torch.rand((20000, 800), dtype=torch.float32, device="cuda")]
if pre_load:
gpu_tensors[0].copy_(tensors[0], non_blocking=False)
for batch_index in range(num_batches):
to_gpu_time -= time.time()
if pre_load:
gpu_tensors[(batch_index+1)%2].copy_(tensors[(batch_index+1)%2], non_blocking=True)
else:
gpu_tensor = tensors[0].to(device, non_blocking=True)
to_gpu_time += time.time()
tensor_access_time -= time.time()
if pre_load:
gpu_tensor = gpu_tensors[batch_index%2]
sum_result = gpu_tensor.sum().item()
tensor_access_time += time.time()
# do some work
total_sleep_2_time -= time.time()
if sleep_2:
time.sleep(sleep_time)
total_sleep_2_time += time.time()
if write_back:
write_back_time -= time.time()
cpu_tensor = gpu_tensor.cpu()
write_back_time += time.time()
```

In theory it should be possible to hide the CPU-GPU-movement time in the sleep time after the tensor access if we are pre-loading one tensor ahead. But unfortunately this seems not to work this way.

Do you have an idea why I am still losing time in the tensor_access part if I pre-load a tensor or do you know how I could solve this problem?