Training while inferencing in Parallel on GPU

Hi, I have an RL agent which I’d like to both train and do inference at the same time using the gpu. Hiowever using cuda with torch.multiprocessing doesn’t seem to work well, and gives this error upon initialization of the other process:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

However using the ‘spawn’ start method gives me another issue, because this method tries to serialize everything using pickle, and this gives me another mysterious error upon zero_grad(). I have detached everything in my replay buffer…

  File "env.py", line 60, in <module>
    main()
  File "env.py", line 55, in main
    process_two.start()
  File "/home/`/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/home/zero/anaconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/zero/anaconda3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/zero/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/zero/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/zero/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/zero/anaconda3/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/home/zero/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 136, in reduce_tensor
    raise RuntimeError("Cowardly refusing to serialize non-leaf tensor which requires_grad, "
RuntimeError: Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries.  If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).

So, my question is, is there any way to do this asynchronous training/inference setup in parallel on a GPU? The default start method of torch.multiprocessing seems to work well on my multi-core CPU. but I’d like to move to using my GPU for optimized matmul operations…