I’m trying to implement a 3D resampling kernel with a sparse feature map input. The basic idea is shown in my PyTorch script:

```
def forward(feat, loc, new_loc, K): # forward function of the 3D resampling layer.
'''feat: float tensor at size (N,C). N is the number of non-empty locations in the 3D space.
loc: int64 tensor at size (N,), which keeps track of the absolute location in the 3D space. It is int64 encoded because we use fixed grid sampling for the input feature map.
new_loc: float tensor at size (M,), which indicates the new (continuous) locations to be interpolated.
kernel: float tensor at size (K,), the 3D resampling kernel.
'''
...
# for each new location, find the absolute location of the input feature within the kernel.
loc_in_kernel = find_neighboring(new_loc, kernel.shape[0]) # int64 size of (M,K)
# find the reversed mapping, such that loc[reversed_idx] = loc_in_kernel.
reversed_idx = find_a_in_b(a=loc_in_kernel, b=loc) # int64 size of (M,K)
# apply the parallelism resample kernel in 3D space. (No memory issue here.)
new_feat = resample3d(feat, kernel, reversed_idx) # size of (M,C)
return new_feat
```

In practice, K can be really large, e.g. 10x10x10=1000, which makes the (M, K) tensors easily out of memory. One solution is to implement the `find_neighboring()`

and `find_a_in_b()`

functions in Cuda and incorporate with `resample3d()`

function. The parallelism will be applied to dimension size M.

However, in c++/cuda, I cannot figure out a way to do the reversed index mapping as `find_a_in_b()`

. Also, I’m not sure if using for-loops to find the reversed index, in this case, will be super slow (Since N can also be at the size of millions)? Can anyone help?