Hi ptrblck,
Sorry to bother you again. This time I try to illustrate my problem more detailed with my example code as follows:
if __name__ == '__main__':
device = 'cuda:2'
torch.cuda.set_device(device)
stream = torch.cuda.Stream()
tensor_a = torch.zeros((100, 10000, 10000), device=device, dtype=torch.float16)
tensor_b = torch.ones((100, 10000, 10000), device=device, dtype=torch.float16)
a_indices = torch.as_tensor([1, 3, 7, 9, 11, 13, 17, 19, 21, 25], device=device, dtype=torch.int)
b_indices = torch.as_tensor([0, 1, 2, 3, 4, 5, 6, 7, 9, 11], device=device, dtype=torch.int)
with torch.cuda.stream(stream):
# tensor_a[:, a_indices, :] = tensor_b[:, b_indices, :]
tensor_a[:, a_indices, :].copy_(tensor_b[:, b_indices, :], non_blocking=True)
In the above code, I want to copy some data from tensor_b to tensor_a. However, this process is really time-consuming. My objective is to hide this overhead, but either wrapping them in a stream or using the non-blocking flag does not take effect. I think the reason is that I am fetching a non-contiguous memory.
My question is, is it possible to make this assignment process asynchronous, which will not block the main process and will not block the subsequent operation in CUDA? If Pytorch cannot do this, is there any workaround with libtorch?
Looking forward to your response.
BR