Slice copy from gpu to cpu tensors

I’m still new to CUDA programming and thought I could write a specific kernel for my application. I have a tensor src which I want to write some of the elements from and copy those values to their corresponding locations in a tensor tgt. Both tensors are 1d, however src is on a gpu and tgt is on CPU memory. I also have access to both the bytemask (torch.BoolTensor) as well as index locations of the elements I want to write to.

I tried using tgt[idxs].copy_(src[idx]), however this fails silently (data isn’t copied to tgt but instead is copied to the new storage of tgt[idxs]), and tgt[idxs] = src[idxs] fails because of different devices.

I have tried writing an at::parallel_for operation for this but it is much slower than I would want, so I am looking to try and write a GPU kernel for this. I know I can access a device-side host pointer with cudaHostGetDevicePointer, however if I’m reading the documentation correctly, this allocates a new host side buffer whereas I already have the host buffer in tgt. Can anyone help me with this?

I have already tried the following kernel which fails due to the tgt buffer being on CPU memory:

template<typename scalar_t>
__global__
void
cuda_mask_copy_kernel(scalar_t*         __restrict__ tgt_1d,
                      const bool*       __restrict__ mask_1d,
                      const scalar_t*   __restrict__ src_1d,
                      size_t N)
{
    const int index = blockIdx.x * blockDim.x + threadIdx.x;

    if (index < N && mask_1d[index])
    {
        tgt_1d[index] = src_1d[index];
    }
}

For completeness, here is my at::parallel_for implementation, but as I said, it is much slower than I would wish:

torch::Tensor&
idx_copy(torch::Tensor& tgt,
         const torch::Tensor& idxs,
         const torch::Tensor& src)
{
    TORCH_CHECK(idxs.sizes().size() == 1);
    CHECK_TYPE(idxs, torch::kLong);
    CHECK_SAME_SIZE(tgt, src);
    TORCH_CHECK(tgt.scalar_type() == src.scalar_type());

    torch::Tensor tgt_1d = tgt.view({-1});
    const torch::Tensor idxs_1d = idxs.view({-1});
    const torch::Tensor src_1d = src.view({-1});

    at::parallel_for(0, idxs_1d.size(0), at::internal::GRAIN_SIZE, [&](int64_t start, int64_t end)
        {
            for(int64_t idx = start; idx < end; ++idx)
            {
                tgt_1d[idxs_1d[idx]] = src_1d[idxs_1d[idx]];
            }
        }
    );

    return tgt;
}

I think the title is slightly inaccurate, but the analysis is spot on. The problem is that the indexing you do (with a list of indices or a mask) necessarily creates a new tensor in new memory. If you had slicing, the indexing would create a view and copy_ would work.
copy_ is the “only way” to transfer from the GPU to CPU (and .to uses it under the hood), so there is not much to do to avoid the transfer and then work on the CPU.
An alternative tgt[idxs] = src[idxs] could be to use where: tgt = torch.where(idxs, src, tgt), this creates a new tensor tgt but does not create src[idxs]. It also is relatively friendly to optimizers like the nvFuser if they are part of a longer computation. Personally, I find where a good fit to how I think about these operations, but of course, this may or may not apply to your case.

Best regards

Thomas

Best

Hey Tom,

Thanks for the response, and I am sorry for the inaccuracy in the title. I would love to use torch.where however I believe that would allocate more gpu memory and I am already stressed for it. I also really want to avoid having to copy the gpu tensor to cpu memory and apply an augmented assignment (tgt[idxs] = src[idxs].cpu()): that would have to be an absolute worst case scenario for me.