Best practice to share CUDA tensors across multiprocess

Hi, I’m trying to build multiprocess dataloader in my local machine, for my RL implementation(ACER).
The reason why I don’t use pytorch dataloader and DataParallel is that I need to get data from remote server. So there are two types of process for my system.

Learner: It’s the main process. It gets the data from receivers, feed the data to DataParallel for traning; its role is consumer

Receiver: It receives the data from network connection, load data into GPU memory, then share the reference for data to Learner. There should be more than one receivers.

So I want to code as below,

# Receiver
def receiver(data_q):
    # do something
    while True:
        data = sock.recv()
        data = data.to('cuda')
        data_q.put(data)
 
# Trainer
def learner(data_q)
    # do something
    while True:
        data = data_q.get()
        model.train_with(data)

if __name__ == '__main__':
    data_q = multiprocessing.Queue(maxsize=10)
    procs = [
        Process(target=receiver, args=(data_q)) for _ in range(10)
    ]
    procs.append(Process(target=learner, args=(data_q)))
    for p in procs: p.start()
    for p in procs: p.join()

but I have some questions.

  1. Can I use above code with python multiprocessing queue?

  2. How can I share the CUDA tensors? It seems share_memory_ function is no-op for them. Is it enough to just put CUDA tensors to pytorch mp queue?

  3. In https://pytorch.org/docs/stable/notes/multiprocessing.html, it says I should keep the CUDA tensors to be not out of scope. Does it mean I should keep the reference for CUDA tensor in receiver(producer) process until processing in learner(consumer) process is done? Or is it enough to keep receiver(producer) process alive even though there’s no reference for CUDA tensor anymore in receiver process?

  4. In https://pytorch.org/docs/stable/notes/multiprocessing.html, it says that it’s better to reuse the buffer passed through the queue. Does it mean share the tensor between receiver(producer) and learner(consumer) at first, then don’t release it from memory but just refill new data to that tensor with inplace operation? If that’s the case, what is the best practice to let receiver know data processing is completed at learner, so that receiver can fill the (already shared) tensor with new data? Should I make flag lock(or event, queue, etc.) for each receiver processes?

  5. Or is it better to share the model to all the receivers so that can call the train function within its process (with lock, to make sure only one process does the training at the same time)?

Did you figure out the answers to these questions? I’m trying to understand the same issues - especially #2. How to share CUDA tensors in multiprocessing?