How to implement work pool in GPU cluster


What is the most efficient way to implement a work pool in GPU cluster? I have tried with JoinableQueue, but it takes long time to get large item (e.g. a train batch) from the queue. Is there a better way to implement it? Is it possible to store data on GPU and shared by different processes?
And I read this documentation for shared memory:
Is this “shared memory” in CPU or GPU? What is the structure of it?
Thank you.

cc @VitalyFedyunin for multiprocessing questions