Using CUDA IPC memory handles in pytorch

@colesbury Thanks so much for all the help on this. I think I’m almost there.

I’ve been playing around with extension-cpp and I’m running into a couple of issues.

As a reference point, I am mostly following the extension-cpp tutorial here:
https://pytorch.org/tutorials/advanced/cpp_extension.html#writing-a-mixed-c-cuda-extension

So I have three files, a .py, a .cpp, and a .cu. I am using the the JIT method for compiling my extension.

In the .cu file, I am using the CUDA runtime API to extract a float* device pointer from a cudaIpcMemHandle. I am then using tensorFromBlob to fill an at::Tensor object. Here is how I am using tensorFromBlob:

at::Tensor cuda_tensor_from_shm = at::CUDA(at::kFloat).tensorFromBlob(d_img, {rows,cols});

My first problem is that the above line of code takes about three seconds to execute. Does it only take so long the first time I call the extension, or is it going to be slow every time? Obviously the whole point of using shared memory and CUDA IPC handles was to make the cost of transferring data negligibly small; I was hoping for sub-millisecond times.

The second problem is that I get a segmentation fault happening at some point between the .cpp code and the .py code. I haven’t precisely pinpointed it yet. However, my guess is that after calling tensorToBlob, I need to copy the data to a new at::Tensor before I can use it in Pytorch. Is that correct? If so, is there a super-fast ATen device-to-device copy I can use?