C++/cuda custom function: RuntimeError: CUDA error: invalid device function

This could mean that the built binary itself should be correct.

Yes, assuming you are using the latest PyTorch release.

Your setup seems to be the issue, since you are mixing the CUDA runtime used in the PyTorch binaries (11.1) with the local CUDA toolkit used to build the extension (10.0), so you would need to stick to the same version. I’ve rebuild the extension on a server with a P100 using matching CUDA versions and this code snippet works fine:

from PAM_cuda.pl import PermutohedralLattice

if __name__ == '__main__':
    import numpy as np
    pl = PermutohedralLattice.apply

    im = torch.randn(24, 24, 3)
    indices = np.reshape(np.indices(im.shape[:2]), (2, -1))[None, :]
    im = im.permute(2, 0, 1)
    rgb = im.reshape(3, -1).unsqueeze(0)
    out = pl(torch.from_numpy(indices / 5.0).cuda().float(),
             (rgb / 0.125).cuda().float())

    output = out.squeeze().cpu().numpy()
    output = np.transpose(output, (1, 0))
    output = np.reshape(output, (im.shape[1], im.shape[2], 3))
    print(output)