Copy value between cuda tensors in non contiguous memory

Hi,

I try to illustrate my problem more detailed with my example code as follows:

if __name__ == '__main__':
    device = 'cuda:2'
    torch.cuda.set_device(device)

    stream = torch.cuda.Stream()
    tensor_a = torch.zeros((100, 10000, 10000), device=device, dtype=torch.float16)
    tensor_b = torch.ones((100, 10000, 10000), device=device, dtype=torch.float16)

    a_indices = torch.as_tensor([1, 3, 7, 9, 11, 13, 17, 19, 21, 25], device=device, dtype=torch.int)
    b_indices = torch.as_tensor([0, 1, 2, 3, 4, 5, 6, 7, 9, 11], device=device, dtype=torch.int)
    
    with torch.cuda.stream(stream):
        # tensor_a[:, a_indices, :] = tensor_b[:, b_indices, :]
        tensor_a[:, a_indices, :].copy_(tensor_b[:, b_indices, :], non_blocking=True)

In the above code, I want to copy some data from tensor_b to tensor_a. However, this process is time-consuming. My objective is to reduce or hide this overhead, but either wrapping them in a stream or using the non-blocking flag does not take effect. I think the reason is that I am fetching a non-contiguous memory.

My question is, is it possible to make this assignment process asynchronous, which will not block the main process and will not block the subsequent operation in CUDA? If Pytorch cannot do this, is there any workaround with libtorch?

Looking forward to your response.

BR

Did you profile your code to verify the operation is not executed asynchronously? If so, could you show the nsys timeline?