Examples for asynchronous RL (IMPALA, Ape-X) with actors sending observations (not gradients) to a learner's replay buffer

Some of the most recently successful distributed/parallel approaches in RL, like IMPALA and Ape-X, start a pool of multiple actors that collect experience/observations and share/send these to a centralized learner (or distributed learners). Optimized Tensorflow implementations are available
(IMPALA from DeepMind and Ape-X from Uber research; can provide links if needed, but that’s TF, so maybe not relevant enough).

However, I can’t seem to find a set of examples that would suggest how to implement this reasonably efficiently in pytorch. Specifically, one common setting is having a single machine with multiple CPU cores (e.g. 32-64) and a few GPUs (1-8). For this case, it seems that python’s multiprocessing libraries and torch.multiprocessing could be useful. However, I can’t find in-depth documentation or examples. Python docs provide minimal information, pytorch docs mention ‘best practices’ but do not detail how to implement them. For example, “buffer reuse” is suggested for torch.multiprocessing.Queue, but it is not clear exactly how to do this (Torch.multiprocessing: how to "reuse buffers passed through a Queue"?). Moreover, for most inter-process communication pickling is involved, but it is not clear whether it is efficient. Is there any way to avoid pickling, or maybe it’s ok as-is? An alternative is something like shared memory arrays in python (multiprocessing.Array), but most documentation warns against using these.

Torch advertises torch.distributed in favor of torch.multiprocesssing. But most examples show to how to gather/share gradient information, or pre-load large static dataset, and stress inter-process communication in gather/map-reduce scenarios, focusing on multiple machines.

It seems the above RL setup on a single (large) machine needs something slightly more custom. Are there open-source (pytorch) code examples for the following use case: actors gather observations using CPU-bound computation, but also use forward passes on (large) NNs to compute actions; these large NNs are updated (asynchronously) by the learner (e.g. using one GPU).

I have a basic implementation of this (for a single machine), but it could be much improved with a proper pytorch advice. I use model.share_memory() for cuda tensors and pass the references for these to actors; I maintain a CPU-based replay buffer that is populated by a CPU worker interfacing with the queue.

Here is a question with some of the similar concerns as I have, I but can’t see the resulting implementation, so can’t use this as an actual example: Multiprocessing CUDA memory