Sharing CUDA tensor between different processes and pytorch versions

I am trying to share one Cuda tensor from one Python script to another… both scripts have different torch versions. My thinking is that if we can share the memory address and the tensor size from one process to another, then it should be easy to read/build that tensor in any other external process since a simple cuda tensor only contains logic for where the data is… and what’s the size of it, as the actual tensor data is in the GPU memory.

So I’ve found some helpful APIs in PyTorch to do so… here’s what I have in my first process - a docker container just to be specific:

# This is process one - docker container 1
def _extract_cuda_metadata(tensor: torch.Tensor):
    storage = tensor._typed_storage()
    (
        storage_device,
        storage_handle,
        storage_size_bytes,
        storage_offset_bytes,
        ref_counter_handle,
        ref_counter_offset,
        event_handle,
        event_sync_required,
    ) = storage._share_cuda_()
    return {
        "dtype": tensor.dtype,
        "tensor_size": tensor.size(),
        "tensor_stride": tensor.stride(),
        "tensor_offset": tensor.storage_offset(),
        "storage_cls": type(storage),
        "storage_device": storage_device,
        "storage_handle": storage_handle,
        "storage_size_bytes": storage_size_bytes,
        "storage_offset_bytes": storage_offset_bytes,
        "requires_grad": tensor.requires_grad,
        "ref_counter_handle": ref_counter_handle,
        "ref_counter_offset": ref_counter_offset,
        "event_handle": event_handle,
        "event_sync_required": event_sync_required,
    }

This is what I am using to extract all the metadata I was talking about. And below is the code for rebuilding the Cuda tensor in another process.

# This is process 2 - docker container 2
rebuild_cuda_tensor(torch.Tensor, **info)
# info is what I have received from process 1 as the output of `_extract_cuda_metadata`

Now, this logic is working absolutely like I was expecting. It’s working even for different torch versions!

But the issue is when these 2 different processes are using torch versions that are too different from one another like torch 2.5 (latest) and torch 2.1… then the same code gives the following error:

RuntimeError: incorrect handle size
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
  1. So does that mean means different torch versions would have different handle sizes for the same Cuda tensor? How does this mechanism work? And how to make the logic generic enough to be able to be used across different torch versions if possible? I understand this might not be possible at all… but still, if I can somehow extend the applicability of my current solution across as many torch versions as possible, that would be enough as well, I don’t need to cover all the torch versions there is.

  2. What alternative if any would you suggest other than my current implementation to implement that same - that is sharing any sort of cuda tensor to any other process in that machine?

[EDIT]
I am aware that I can deallocate GPU to CPU memory, encode it, and share it… but that’s not optimal and is inefficient. Takes too much time. Not practical.

[EDIT] I just had another test, where I am able to share cuda tensor from 2.1 to 2.5 with the code mentioned above … but not from 2.5 to 2.1.

Thanks a lot for your time.