Multiprocessing with shared cuda stream

Hello, I am trying to generate multiple Processes that each have their own cuda stream and are able to sync to a main process.
The goal is to have each stream receive camera images, shove it on the gpu, do some preprocessing and then provide the data to the main process for a machine learning application.
I found an example here that works with multiple streams and multiprocessing here:

The example just creates new streams for every process, due to the fact that spawn start method reinitialize the Stream on each process.

What I need is shared stream that I can sync, so I know when I can acces the data after calculations are finished to avoid race conditions.

I tried to use Stream.cuda_stream to pass the pointer to an ExternalStream as seen in the example below but get the error:
RuntimeError: CUDA error: invalid resource handle

Here is my toy example:

import os
import time
import torch
from torch.multiprocessing import Process, set_start_method

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
set_start_method("spawn", True)

def process(stream_ptr, data):
    stream = torch.cuda.ExternalStream(stream_ptr, "cuda:0")
    with torch.cuda.stream(stream):
        data[:] = data * 2

if __name__ == "__main__":
    stream = torch.cuda.Stream(device='cuda:0')
    with torch.cuda.stream(stream):
        data = torch.ones((2, 2), device='cuda:0')
        data.share_memory_()

    p1 = Process(target=process, args=(stream.cuda_stream, data))
    p1.start()
    timeout_start = time.perf_counter()
    while time.perf_counter() - timeout_start < 10:
        print(data)
        time.sleep(1)
    p1.join()