I am following the PyTorch C FFI examples to build a C extension with CUDA.
When I took the data from CudaTensor as follows, I got a Segmentation fault. Is there any way to access the cuda memory like this and to run the loop (for/while) in parallel in the C extension with CUDA.
I’m sorry to bother you again .
I want to know how to bind the CUDA program to pytorch if I have a program in my_lib.cu.
FFI in the examples do not seem to support .cu and nvcc?
You can take out the data pointers from the arguments of your FFI function (using THCudaTensor_data(state, tensor)), inspect it’s size (assuming it’s contiguous - THCudaTensor_numel(state, tensor)), and pass both these things into your custom kernel. Does that help?
No, you’re not. It’s a problem with our CUDA extensions - they never use nvcc to compile your files. A workaround for now is to compile your CUDA kernels manually, link them as a shared library and add that shared library in libraries argument of create_extension. Sorry for that
Thank you for your comments! With your help, I hacked my way through. I did using sligtly different argument of ffibuilder.set_source that described previously:
I first compile all cuda kernels:
Cmake:
Hi, your codes are really helpful (including other several repositories). Actually I am following your code to write my own operator. But I am a little confused about contiguous, because I cannot tell which operation would make results not contiguous. So, should we always check whether the tensor is contiguous before we get the data pointer by THCudaTensor_data?
@Jiang_He I have the same concern. If an input tensor is not contiguous, using THCudaTensor_data may lead to the wrong data. In the new tutorial, the check is always performed:
#define CHECK_CONTIGUOUS(x) AT_ASSERT(x.is_contiguous(), #x " must be contiguous")