Error: RuntimeError: CUDA error: device-side assert triggered

after several epochs, I encounter the “RuntimeError: CUDA error: device-side assert triggered” in the interpolation function. I already checked if the there is an error in the indexing but seems everything alright. I appreciate any useful hint.
Versions of relevant libraries:
[pip3] numpy==1.21.5
[pip3] pytorch-lightning==1.5.2
[pip3] torch==1.10.1
[pip3] torch-cluster==1.5.9
[pip3] torch-geometric==2.0.3
[pip3] torch-points-kernels==0.7.0
[pip3] torch-scatter==2.0.9
[pip3] torch-sparse==0.6.12
[pip3] torch-spline-conv==1.2.1
[pip3] torchaudio==0.10.1
[pip3] torchmetrics==0.6.0
[pip3] torchvision==0.11.2

def interpolation(xyz, new_xyz, feat, offset, new_offset, k=3):
input: xyz: (m, 3), new_xyz: (n, 3), feat: (m, c), offset: (b), new_offset: (b)
output: (n, c)
assert xyz.is_contiguous() and new_xyz.is_contiguous() and feat.is_contiguous()
idx, dist = knnquery(k, xyz, new_xyz, offset, new_offset) # (n, 3), (n, 3)
dist_recip = 1.0 / (dist + 1e-8) # (n, 3)
norm = torch.sum(dist_recip, dim=1, keepdim=True)
weight = dist_recip / norm # (n, 3)

new_feat = torch.cuda.FloatTensor(new_xyz.shape[0], feat.shape[1]).zero_()
for i in range(k):

    new_feat += feat[idx[:, i].long(), :] * weight[:, i].unsqueeze(-1)
    #print(feat.size(),idx[:, i])
return new_feat

I would guess that you are running into an indexing issue.
Rerun the code via CUDA_LAUNCH_BLOCKING=1 python args or on the CPU to get a better stacktrace.

If I rerun the code on the CPU only I directly get the error " CUDA error: device-side assert triggered" in line torch.sqrt(dist2). There are no nan values or values smaller than 0 in dist2. The error could result from the function itself since the knnquery is a C++ and Cuda implementation.

Rerunning the code on the CPU shouldn’t raise a CUDA error, if the GPU is not used at all.
I don’t know if you are using a Jupyter notebook or another REPL environment, but make sure to restart the environment once you ran into a CUDA assert as these errors might be sticky.

1 Like

The knnQuery is always running on GPU since it is a compiled GPU library. I will implement the functions with pure pytorch functions and check if the error still appears. I do not use any REPL environment. Thank you for the recommendations.