Reversed index mapping in c++/cuda extension?

I’m trying to implement a 3D resampling kernel with a sparse feature map input. The basic idea is shown in my PyTorch script:

def forward(feat, loc, new_loc, K): # forward function of the 3D resampling layer.
    '''feat:     float tensor at size (N,C). N is the number of non-empty locations in the 3D space. 
       loc:      int64 tensor at size (N,), which keeps track of the absolute location in the 3D space. It is int64 encoded because we use fixed grid sampling for the input feature map. 
       new_loc:  float tensor at size (M,), which indicates the new (continuous) locations to be interpolated. 
       kernel:   float tensor at size (K,), the 3D resampling kernel.
    '''
    ...
    # for each new location, find the absolute location of the input feature within the kernel. 
    loc_in_kernel = find_neighboring(new_loc, kernel.shape[0]) # int64 size of (M,K)

    # find the reversed mapping, such that loc[reversed_idx] = loc_in_kernel.
    reversed_idx = find_a_in_b(a=loc_in_kernel, b=loc) # int64 size of (M,K)

    # apply the parallelism resample kernel in 3D space. (No memory issue here.)
    new_feat = resample3d(feat, kernel, reversed_idx) # size of (M,C)
    return new_feat

In practice, K can be really large, e.g. 10x10x10=1000, which makes the (M, K) tensors easily out of memory. One solution is to implement the find_neighboring() and find_a_in_b() functions in Cuda and incorporate with resample3d() function. The parallelism will be applied to dimension size M.

However, in c++/cuda, I cannot figure out a way to do the reversed index mapping as find_a_in_b(). Also, I’m not sure if using for-loops to find the reversed index, in this case, will be super slow (Since N can also be at the size of millions)? Can anyone help?

Not a complete answer but you should probably take a look into thrust library, it offers lots of building blocks for such operations on GPU.