Getting weird nan value out of dot product on cuda side

This is not directly a PyTorch question, but I’m going to post it here in hopes that someone smarter than I might be able to help anyway.

I am working on a Cuda kernel, which calls some simple helper methods for linear algebra. In particular, one helper method is the dot product. It always returns NaN. My question is why?

The Kernel:

void find_intersection_kernel(
    int n_rays, int n_faces, 
    at::PackedTensorAccessor32<float_t, 2> vertex_acc,
    at::PackedTensorAccessor32<int32_t, 2> face_acc, 
    at::PackedTensorAccessor32<float_t, 2> origin_acc,
    at::PackedTensorAccessor32<float_t, 2> direct_acc,
    at::PackedTensorAccessor32<float_t, 2> results) {

    int ray_ix = blockIdx.x * blockDim.x + threadIdx.x;
    int face_ix = blockIdx.y * blockDim.y + threadIdx.y;

    if (ray_ix < n_rays) {
        if (face_ix < n_faces) {
            // vectors: ray origin and direction
            float_t *ray_direction_ptr = &direct_acc[ray_ix][0];
            float_t *ray_origin_ptr = &origin_acc[ray_ix][0];

            // vectors: vertices of current face
            float_t *v0_ptr = &vertex_acc[face_acc[face_ix][0]][0];
            float_t *v1_ptr = &vertex_acc[face_acc[face_ix][1]][0];
            float_t *v2_ptr = &vertex_acc[face_acc[face_ix][2]][0];
            // get edge vectors of current face
            float_t edge1[3];
            float_t edge2[3];
            sub3d(v1_ptr, v0_ptr, edge1);
            sub3d(v2_ptr, v0_ptr, edge2);

            // determinant of matrix A for eqn Ax = b
            float_t h[3];
            cross3d(ray_direction_ptr, edge2, h);

            printf("(%d, %d) - value of e1 vector: [%f, %f, %f]\n", ray_ix, face_ix, edge1[0], edge1[1], edge1[2]);
            printf("(%d, %d) - value of h vector: [%f, %f, %f]\n", ray_ix, face_ix,  h[0], h[1], h[2]);

            float_t determinant = dot3d(edge1, h);
            results[ray_ix][face_ix] = 0;

The troublesome helper:

float_t dot3d(float_t v1[3], float_t v2[3]) {

    printf("(%d, %d) - v1: [%f, %f, %f], v2: [%f, %f, %f]\n", threadIdx.x, threadIdx.y, v1[0], v1[1], v1[2], v2[0], v2[1], v2[2]);
    float_t result = 0;
    for (int r = 0; r < 3; r++){
        float_t prod = v1[r] * v2[r];
        printf("(%d, %d, %d) - product: %f\n", r, threadIdx.x, threadIdx.y, prod);
        result = result + prod;
    printf("(%d, %d) - result: %f\n", threadIdx.x, threadIdx.y, result);
    return result;

What I’ve tried on my own:

  • using printf inside the helper method, I have verified that, for all threads and for all r values, v1[r] and v2[r] are correct.

  • I have also verified that the intermediate product in each iteration of the dot product calculation is correct, again except for the last one.

  • I have verified that the arguments are correct both inside and outside the scope of the function.

  • I have also worked out the math on paper and verified that the determinant is 4.

(0, 0) - value of e1 vector: [2.000000, 2.000000, 0.000000]
(1, 0) - value of e1 vector: [2.000000, 2.000000, 0.000000]
(0, 0) - value of h vector: [2.000000, 0.000000, 0.000000]
(1, 0) - value of h vector: [2.000000, 0.000000, 0.000000]
(0, 0) - v1: [2.000000, 2.000000, 0.000000], v2: [2.000000, 0.000000, 0.000000]
(1, 0) - v1: [2.000000, 2.000000, 0.000000], v2: [2.000000, 0.000000, 0.000000]
(0, 0, 0) - product: 4.000000
(0, 1, 0) - product: 4.000000
(1, 0, 0) - product: 0.000000
(1, 1, 0) - product: 0.000000
(2, 0, 0) - product: nan
(2, 1, 0) - product: nan
(0, 0) - result: nan
(1, 0) - result: nan


Why are the first two results correct but the second always NaN? How do I troubleshoot issues like this in the future?

To rule out other potential issues, are you able to reproduce these results when passing in fixed values for the vectors rather than relying on results from other helper functions?

It also doesn’t look like there is an obvious issue with memory accesses, but I would also check if e.g., compute-sanitizer --tool=memcheck reports any errors when running your kernel. In case it is run with Tensors allocated by PyTorch, you could use e.g., PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool=memcheck python to check this.

Okay, so the issue does not persist when I hard code the inputs to be what I know the outputs of the previous helper is:

float_t dummy1[3] = {2.000000, 2.000000, 0.00000};
float_t dummy2[3] = {2.000000, 0.000000, 0.00000};
float_t determinant = dot3d(dummy1, dummy2);

In this case I get 4, which is the correct answer.

I am re-building and trying your diagnostic tool now.

Is the compute sanitizer part of the cuda toolkit that I need to add to my PATH variable? Bash isn’t finding it right off the bat.

It should be part of the CUDA toolkit: Compute Sanitizer User Manual :: Compute Sanitizer Documentation