Is there a way to share device-allocated tensor with multiple processes / threads s.t. all processes / threads would read form the same memory region instead of having its own copy of the tensor? I know that, using CUDA IPC API, you can share array allocated using cudaMalloc
with multiple processes, so this could be possible in pytorch.