I’m still new to CUDA programming and thought I could write a specific kernel for my application. I have a tensor src
which I want to write some of the elements from and copy those values to their corresponding locations in a tensor tgt
. Both tensors are 1d, however src
is on a gpu and tgt
is on CPU memory. I also have access to both the bytemask (torch.BoolTensor
) as well as index locations of the elements I want to write to.
I tried using tgt[idxs].copy_(src[idx])
, however this fails silently (data isn’t copied to tgt
but instead is copied to the new storage of tgt[idxs]
), and tgt[idxs] = src[idxs]
fails because of different devices.
I have tried writing an at::parallel_for
operation for this but it is much slower than I would want, so I am looking to try and write a GPU kernel for this. I know I can access a device-side host pointer with cudaHostGetDevicePointer
, however if I’m reading the documentation correctly, this allocates a new host side buffer whereas I already have the host buffer in tgt
. Can anyone help me with this?
I have already tried the following kernel which fails due to the tgt
buffer being on CPU memory:
template<typename scalar_t>
__global__
void
cuda_mask_copy_kernel(scalar_t* __restrict__ tgt_1d,
const bool* __restrict__ mask_1d,
const scalar_t* __restrict__ src_1d,
size_t N)
{
const int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < N && mask_1d[index])
{
tgt_1d[index] = src_1d[index];
}
}
For completeness, here is my at::parallel_for
implementation, but as I said, it is much slower than I would wish:
torch::Tensor&
idx_copy(torch::Tensor& tgt,
const torch::Tensor& idxs,
const torch::Tensor& src)
{
TORCH_CHECK(idxs.sizes().size() == 1);
CHECK_TYPE(idxs, torch::kLong);
CHECK_SAME_SIZE(tgt, src);
TORCH_CHECK(tgt.scalar_type() == src.scalar_type());
torch::Tensor tgt_1d = tgt.view({-1});
const torch::Tensor idxs_1d = idxs.view({-1});
const torch::Tensor src_1d = src.view({-1});
at::parallel_for(0, idxs_1d.size(0), at::internal::GRAIN_SIZE, [&](int64_t start, int64_t end)
{
for(int64_t idx = start; idx < end; ++idx)
{
tgt_1d[idxs_1d[idx]] = src_1d[idxs_1d[idx]];
}
}
);
return tgt;
}