CUDA 11.6 extension problem

Hello I am trying to use my CUDA kernel in Pytorch python. PyTorch 1.10.2+cu113 on Visual studio 2019 with CUDA 11.6 installed , pybind 2.8.1 ; Python 3,7

In documentation of pytorch and its forums it is indicated tha one shoud use Aten library instead of pytorch.extension in order to avoid

"too few arguments for template template parameter "Tuple"  

And those workaround works Hovewer When I try to include the pybind into my .cu file by

#include <pybind11/pybind11.h>

I get the error mentioned above

If I will comment out

//#include <pybind11/pybind11.h>

all compiles - and .pyd files is produced into a solution folder - Although until I do not have connection to python file I do not know wheather it works

Detailed description of the configuration and minimal example Code I use

  1. Installation of Python 3.7 with debug Libraries and download from Pytorch libtorch-win-shared-with-deps-debug-1.10.2+cu113
  2. in vcpkg - install Version: 2.8.1; triple set to Windows 64
.\vcpkg install pybind11 
.\vcpkg integrate install
  1. In Visual studio 2019 - new CUDA 11.6 project
  2. Set architecture to x64
  3. Editing properties - if not stated otherwise editing for all configrations
    a) Vc++ Directories - to include directories
  • add Python37/include folder
  • add …\libtorch-win-shared-with-deps-debug-1.10.2+cu113\libtorch\include
  • add …\libtorch-win-shared-with-deps-debug-1.10.2+cu113\libtorch\torch\csrc\api\include
    b) Vc++ Directories - to Library directories
  • add Python37/lib folder
    c) Linker - Additional Dependencies - for debug python37_d.lib for release python37.lib
    d) C/C++ → Language → Conformance mode to No
    e) General tab- Additional Library Directories- …\libtorch-win-shared-with-deps-debug-1.10.2+cu113\libtorch\lib
    f) Input Tab - additional dependencies - adding torch.lib, torch_cuda.lib, caffe2_nvrtc.lib, c10.lib, c10_cuda.lib, torch_cpu.lib
    g) Build Events → Post-Build Events →xcopy $(ProjectDir)…\libtorch-win-shared-with-deps-debug-1.10.2+cu113\libtorch\lib*.dll $(SolutionDir)$(Platform)$(Configuration)\ /c /y
    h) General - Advanced - Target File Extension - changed to .pyd
    i) CUDA C/C++ - (-rdc=true)
    j) CUDA C/C++ - Device - code generation - compute_75,sm_75
    k) CUDA Linker - General - Additional Library directories added libtorch\lib

Currently I have single file ( did not got to Python file Yet) inside I have simple code

#include "cuda_runtime.h"
#include "cuda.h"
#include "device_launch_parameters.h"
#include "stdio.h"

#include <ATen/ATen.h>
#include <Python.h>
#include <ATen/Parallel.h>
#include <ATen/Dispatch.h>
#include <ATen/core/op_registration/op_registration.h>
#include <ATen/cuda/CUDAContext.h>
#include <ATen/native/TensorIterator.h>
#include <ATen/quantized/Quantizer.h>
#include <c10/cuda/CUDAGuard.h>
#include <torch/library.h>
#include <ATen/native/StridedRandomAccessor.h>
#include <cooperative_groups.h>
#include <cooperative_groups/reduce.h>

#include <pybind11/pybind11.h>

#pragma once		
        template <typename scalar_t>
        __global__ void lltm_cuda_forward_kernel(        
             at::PackedTensorAccessor<scalar_t, 1, at::RestrictPtrTraits> input,
             at::PackedTensorAccessor<scalar_t, 1, at::RestrictPtrTraits> output,int  length){

            for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < length; i += blockDim.x * gridDim.x) {
                output[i] = input[i] + 1;

#pragma once
static void testAdd(
    at::Tensor inputtt,
    at::Tensor output) {
    int blocks = 10;
    int threads = 512;

    cudaStream_t stream = at::cuda::getCurrentCUDAStream();
    int length = 5;

        switch (inputtt.type().scalarType()) {
    case at::ScalarType::Float:
        lltm_cuda_forward_kernel<float> << <blocks, threads, 0, stream >> > (inputtt.packed_accessor64<float, 1, at::RestrictPtrTraits>()
            , output.packed_accessor64<float, 1, at::RestrictPtrTraits>(), length);


int main() {

    printf("  *** ");

    return 0;

What am I doing wrong - is there a better way ? For some obscure reason I can achieve greater occupancy with my kernel using Visual studio project than in the cmake - but I can compile it also with cmake if it needs to be so.

I was also trying to create a separate cpp file and there invoke kernel by including .cu file but cooperative group libraries gave a lot of compile errors in such situation.


So I did not figured out How to manage it I downgraded to CUDA 11.3 and all works

1 Like

I had the same issue, can confirm downgrade to 11.3 solved it !