Is it possible to speedup the value assignment process in cuda tensor?

Hi,

I have a question about the value assignment on PyTorch.
For example, I want to copy values from a cuda tensor ‘a’ to a cuda tensor ‘b’, from a series of locations. A simple solution is:

b[loc_list] = a[loc_list]

where the ‘loc_list’ is a list of locations, e.g., loc_list = [1,3,5,7,9]. However, this assignment seems inefficient, since its overhead can increase with the length of loc_list.

My question is, is it possible to parallelize this assignment process? No limitation on methods and any C++ implementation is OK.

Thanks.

If you transform loc_list to a tensor and move it to the GPU, you’ll remove the H2D copy (assuming you are reusing the indices a few times and can thus benefit from the reduction of copies).

Hi ptrblck,

Thanks for your response. I would like to know whether PyTorch can parallelize the assignment process for multiple rows in variable ‘b’ after I transform the ‘loc_list’ into a tensor?

Hi ptrblck,

Sorry to bother you again. This time I try to illustrate my problem more detailed with my example code as follows:

if __name__ == '__main__':
    device = 'cuda:2'
    torch.cuda.set_device(device)

    stream = torch.cuda.Stream()
    tensor_a = torch.zeros((100, 10000, 10000), device=device, dtype=torch.float16)
    tensor_b = torch.ones((100, 10000, 10000), device=device, dtype=torch.float16)

    a_indices = torch.as_tensor([1, 3, 7, 9, 11, 13, 17, 19, 21, 25], device=device, dtype=torch.int)
    b_indices = torch.as_tensor([0, 1, 2, 3, 4, 5, 6, 7, 9, 11], device=device, dtype=torch.int)
    
    with torch.cuda.stream(stream):
        # tensor_a[:, a_indices, :] = tensor_b[:, b_indices, :]
        tensor_a[:, a_indices, :].copy_(tensor_b[:, b_indices, :], non_blocking=True)

In the above code, I want to copy some data from tensor_b to tensor_a. However, this process is really time-consuming. My objective is to hide this overhead, but either wrapping them in a stream or using the non-blocking flag does not take effect. I think the reason is that I am fetching a non-contiguous memory.

My question is, is it possible to make this assignment process asynchronous, which will not block the main process and will not block the subsequent operation in CUDA? If Pytorch cannot do this, is there any workaround with libtorch?

Looking forward to your response.

BR