How to implement work pool in GPU cluster

Hi,

What is the most efficient way to implement a work pool in GPU cluster? I have tried with JoinableQueue, but it takes long time to get large item (e.g. a train batch) from the queue. Is there a better way to implement it? Is it possible to store data on GPU and shared by different processes?
And I read this documentation for shared memory: https://pytorch.org/docs/stable/notes/multiprocessing.html
Is this “shared memory” in CPU or GPU? What is the structure of it?
Thank you.

cc @VitalyFedyunin for multiprocessing questions