Error when compiling custom cuda kernel

I’m trying to write my own CUDA kernel but I’m not able to compile it using CPP extension. I am following the guide here, and I was able to get a C++ version working fine. Full code and error messages below:

global void linear_forward_cuda( float * input, float * weight, float * bias, float * output){
const int tid = threadIdx.x;
const int bid = blockIdx.x;
const int m = blockDim.x;

shared float shared_sum[1024];
shared_sum[tid] = weight[(bid*m) + tid] * input[tid];

for (unsigned int i = m/2; i > 0 ; i = i>>1){
if (tid < i){
shared_sum[tid] += shared_sum[tid+i];
}
__syncthreads();
}

if (tid==0) {
output[bid] = shared_sum[0] + bias[bid];
}
}

In function ‘at::Tensor linear_forward(at::Tensor, at::Tensor, at::Tensor)’:
/home/karthik/projects/torch_files/approx-mult/cuda/cuda_layers.cpp:58:26: error: expected primary-expression before ‘<’ token
58 | linear_forward_cuda<<<threads,blocks>>>(input, weight_transposed, bias, output);

I am calling this kernel from another .CPP file. And I am adding them both to the setup.py file as shown in the guide I linked above.

This kernel works just fine when I call it via another .cu file and compile that file using just NVCC. My guess is Ninja is trying to compile this file using gcc instead of using nvcc? Since this is all handled by CPP extension, I’m not sure how to fix this. Any help would be much appreciated.

EDIT: I tried renaming both files to .cu to use the NVCC compiler for both and it seems to work. But I’m not sure if that it the right way to fix this.

It is. The CUDA kernel launch syntax is not standard C++, hence it fails when you compile with a regular C++ compiler. If you don’t like that, you could put only CUDA code inside the .cu file, define wrappers inside that same .cu file which will simply launch the kernels, and then export those wrappers so that you could use them (and not the CUDA kernel launch syntax) from your other .cpp files.

Thank you. It was just because the instructions I was following on the pytorch website said to have the kernel invocation in the cpp file instead of the cu file. But just having the entire thing in the cu file seems to work just fine.