It will be a bit tricky to do correctly because small PyTorch storages are packed into the same CUDA allocation block. You will have to rely on implementation details of PyTorch that may change in the future:
x = torch.randn(100, device='cuda')
storage = x.storage()
device, handle, size, offset, view_size = storage._share_cuda_()
device
is the index of the GPU (i.e. 0 for the first GPU)
handle
is the cudaIpcMemHandle_t
as a Python byte string
size
is the size of the allocation (not the Storage!, in elements, not bytes!)
offset
is the offset in bytes of the storage data pointer from the CUDA allocation
view_size
is the size of the storage (in elements, not bytes!)