Using CUDA IPC memory handles in pytorch

abweiss · May 5, 2018, 12:52am

I want to insert a trained pytorch model into the middle of a multi-process pipeline. The input/output data for the model should never move off the GPU. Device pointers to the data need to be passed back and forth between processes using CUDA IPC memory handles.

Basically, I need a way to access/create the IPC handles and to convert to/from torch.cuda.*Tensor objects.

What is the best way to implement this? I know pycuda gives access to CUDA IPC handles (e.g. pycuda.driver.mem_get_ipc_handle), but from my experience pycuda does not play nicely with pytorch. Are there any other simple solutions in the python realm?

colesbury · May 7, 2018, 5:40pm

You can share CUDA tensors across processes using multiprocessing queues. (e.g. multiprocessing.SimpleQueue) The PyTorch code will create an IPC handle when the tensor is added to the queue and open that handle when the tensor is retrieved from the queue.

Beware that you need to keep the original CUDA tensor alive for at least as long as any view of it is accessible in another process.

abweiss · May 7, 2018, 9:54pm

Thanks for the quick response @colesbury.

Just to clarify, the other processes in the pipeline are not python processes (they are C/C++/CUDA). So it’s important that I can access/create IPC handles with device pointers to the raw underlying tensor data. My confusion is how to work with these handles within the python/pytorch process. Correct me if I’m wrong, but it seems that multiprocessing.SimpleQueue will only work the way you describe if both processes are using Pytorch.

So, just to be absolutely clear, the full plan is to use shared memory to pass IPC handles between processes. For example, the shared memory file will include a 64byte cudaIpcMemHandle_t (containing a pointer to the raw data in GPU memory), plus additional bytes to specify the number of rows and columns in the tensor.

colesbury · May 8, 2018, 7:50pm

It will be a bit tricky to do correctly because small PyTorch storages are packed into the same CUDA allocation block. You will have to rely on implementation details of PyTorch that may change in the future:

x = torch.randn(100, device='cuda')
storage = x.storage()
device, handle, size, offset, view_size = storage._share_cuda_()

device is the index of the GPU (i.e. 0 for the first GPU)
handle is the cudaIpcMemHandle_t as a Python byte string
size is the size of the allocation (not the Storage!, in elements, not bytes!)
offset is the offset in bytes of the storage data pointer from the CUDA allocation
view_size is the size of the storage (in elements, not bytes!)

abweiss · May 9, 2018, 2:25am

Thanks again @colesbury.

So _share_cuda_() gives me access to the cudaIpcMemHandle_t of an existing torch.cuda.Tensor. It’s unfortunate that the handle is not exposed through a regular function call, but it’s a good start.

Now, what about when I need to convert the other way around, from handle to tensor? If I have a cudaIpcMemHandle_t, read in from shared memory and converted to a Python byte string, can I insert that into a torch.cuda.Storage and thereby produce a torch.cuda.Tensor which points to the appropriate data from GPU memory?

Also, can you explain the offset a bit more? It sounds like multiple different torch.cuda.Storage objects share the same cudaIpcMemHandle_t, but with different offsets in memory. Is that correct? I don’t see that as a major problem. I’ll just have to write the offset to shared memory as well.

Another idea altogether: What about using Pytorch’s extension-ffi to access the cudaIpcMemHandle_t and storing the data into a THCudaTensor? I’ve never played with the extension-ffi before, so I don’t really understand it’s capabilities. I’ll need to make calls to functions like cudaIpcOpenMemHandle, which are part of CUDAs runtime API. Is this possible?

colesbury · May 9, 2018, 5:08pm

If you want to go back and forth between C/C++ and Python you probably want to use an extension. You should prefer https://github.com/pytorch/extension-cpp over extension-ffi as TH/THC is being slowly deprecated and moved into ATen.

ATen provides a Type::storageFromBlob function which you can use after you open the IPC handle.

I don’t think there’s an equivalent function in Python. It would probably be good for us to add something like that.

abweiss · May 16, 2018, 1:46am

@colesbury Thanks so much for all the help on this. I think I’m almost there.

I’ve been playing around with extension-cpp and I’m running into a couple of issues.

As a reference point, I am mostly following the extension-cpp tutorial here:
https://pytorch.org/tutorials/advanced/cpp_extension.html#writing-a-mixed-c-cuda-extension

So I have three files, a .py, a .cpp, and a .cu. I am using the the JIT method for compiling my extension.

In the .cu file, I am using the CUDA runtime API to extract a float* device pointer from a cudaIpcMemHandle. I am then using tensorFromBlob to fill an at::Tensor object. Here is how I am using tensorFromBlob:

at::Tensor cuda_tensor_from_shm = at::CUDA(at::kFloat).tensorFromBlob(d_img, {rows,cols});

My first problem is that the above line of code takes about three seconds to execute. Does it only take so long the first time I call the extension, or is it going to be slow every time? Obviously the whole point of using shared memory and CUDA IPC handles was to make the cost of transferring data negligibly small; I was hoping for sub-millisecond times.

The second problem is that I get a segmentation fault happening at some point between the .cpp code and the .py code. I haven’t precisely pinpointed it yet. However, my guess is that after calling tensorToBlob, I need to copy the data to a new at::Tensor before I can use it in Pytorch. Is that correct? If so, is there a super-fast ATen device-to-device copy I can use?

abweiss · May 18, 2018, 2:13am

Everything works after modifying my tensorFromBlob code from:

at::Tensor cuda_tensor_from_shm = at::CUDA(at::kFloat).tensorFromBlob(d_img, {rows,cols});

to:
at::Tensor cuda_tensor_from_shm = torch::CUDA(at::kFloat).tensorFromBlob(d_img, {rows,cols});

I’ll need to dig into the code to understand why torch::CUDA is the correct scoping, but anyway it works.