Can't compile simple example cuda kernel

Good afternoon all,

I have been looking at converting some cpp code that I’ve had good success to a cuda kernel to maybe squeeze even more speed out of it. I decided to try my hand at a simple kernel first, a slight tweak on the example found here

However, when try to compile this, nvcc just hangs forever:

I think that it’s not an issue with my code itself, but just in case this is it:

#include <torch/script.h>

using namespace torch;

void add_kernel(int N, float x[], float y[], float z[]) {
    int64_t i = threadIdx.x;
    int64_t s = blockDim.x;

    for (int64_t ix = i; i < N; i += s) {
        z[i] = x[i] + y[i];

Tensor add_gpu(Tensor a, Tensor b) {

    int64_t N = a.size(0);
    size_t bytes = N * sizeof(float);

    Tensor result = torch::empty(N);

    float* d_a;
    float* d_b;    
    float* d_z;
    float* h_z = result.data_ptr<float>();
    cudaMallocManaged(&d_a, bytes);
    cudaMallocManaged(&d_b, bytes);
    cudaMallocManaged(&d_z, bytes);

    cudaMemcpy(d_a, a.data_ptr<float>(), bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b.data_ptr<float>(), bytes, cudaMemcpyHostToDevice);

    int64_t block_size = 256;
    int64_t grid_size = (int64_t)ceil(N / block_size);

    add_kernel<<<grid_size, block_size>>>(N, d_a, d_b, d_z);
    cudaMemcpy(h_z, d_z, bytes, cudaMemcpyDeviceToHost);


    return result;

int main(void) {
    int N = 100000;
    torch::Tensor x = torch::arange(N);
    torch::Tensor y = x + 1;
    Tensor z = add_gpu(x, y);


Can anyone explain how to correctly compile and run this little test?

Is nvcc really hanging or is it expecting more input?
I’m not familiar with Windows/MINGW64, but on Linux bash will create a newline if \ is added as seen here:

$ ls
build  CMakeLists.txt  main.cpp

$ ls \

Note that my cursor drops to the new line and I can continue with my command, e.g.:

$ ls \
> ./build
CMakeCache.txt  CMakeFiles  cmake_install.cmake

It looks like it was expecting a filename, not a directory. Once provided with the full path to script.h, it worked… kind of.

Now it can’t find the includes listed at the top of script.h and providing a second file location in the command line argument does not seem to remedy things.

Rather than trying to compile with nvcc directly, I decided to try using torch::RegisterOperators and torch.cpp_extension to take care of compiling the cuda file in the background. That worked and since I’m specifically looking to load custom pytorch operators was good enough. I’d like to learn more about general nvcc compilation some day.