I want to hide the time needed to load tensors from CPU to GPU. This can in general be done by using gpu_tensor = tensor.cuda(non_blocking=True).
This will block as soon as I access the gpu_tensor. But let’s suppose I do some work only after I access the tensor the first time on the GPU. To still be able to hide the CPU-GPU-movement time, I would like to load the tensor one iteration ahead.
Here is a little code example:
tensors = [torch.rand((20000, 800), dtype=torch.float32).pin_memory(), torch.rand((20000, 800), dtype=torch.float32).pin_memory()] gpu_tensors = [torch.rand((20000, 800), dtype=torch.float32, device="cuda"), torch.rand((20000, 800), dtype=torch.float32, device="cuda")] if pre_load: gpu_tensors.copy_(tensors, non_blocking=False) for batch_index in range(num_batches): to_gpu_time -= time.time() if pre_load: gpu_tensors[(batch_index+1)%2].copy_(tensors[(batch_index+1)%2], non_blocking=True) else: gpu_tensor = tensors.to(device, non_blocking=True) to_gpu_time += time.time() tensor_access_time -= time.time() if pre_load: gpu_tensor = gpu_tensors[batch_index%2] sum_result = gpu_tensor.sum().item() tensor_access_time += time.time() # do some work total_sleep_2_time -= time.time() if sleep_2: time.sleep(sleep_time) total_sleep_2_time += time.time() if write_back: write_back_time -= time.time() cpu_tensor = gpu_tensor.cpu() write_back_time += time.time()
In theory it should be possible to hide the CPU-GPU-movement time in the sleep time after the tensor access if we are pre-loading one tensor ahead. But unfortunately this seems not to work this way.
Do you have an idea why I am still losing time in the tensor_access part if I pre-load a tensor or do you know how I could solve this problem?