Use torch.multiprocessing.queue with cuda tensor

I am trying to make use of multiprocessing to move data batches to GPU in a dedicated process. This process should get values from an input queue of python values or numpy arrays, transform them into pytorch’s cuda tensor, and put the result into an output queue. The main process of training model will then get and use the ready-to-use cuda tensor from the output queue without the need of further processing. My purpose is to save the time consumed by moving tensor to GPU, which may take up to 30% total time in a project I am working on depending on model size.

My experimental code is like this:

import torch
import torch.multiprocessing as mp
import time

def worker(alert, q1, q2, dev):
    while True:
        alert.wait()
        alert.clear()
        q2.put(torch.tensor(q1.get(), dtype=torch.float32, device=dev, requires_grad=False))

if __name__ == '__main__':

    dev = 'cuda' # 'cpu' or 'cuda'

    alert = mp.Event()
    q1 = mp.Queue()
    q2 = mp.Queue()
    p = mp.Process(target=worker, args=(alert, q1, q2, dev), daemon=True)
    p.start()

    for i in range(3):
        q1.put(i)
        alert.set()
        time.sleep(1)

    for i in range(3):
        print(q2.get())

But I got this error:

THCudaCheck FAIL file=c:\programdata\miniconda3\conda-bld\pytorch_1524546371102\work\torch\csrc\generic\StorageSharing.cpp line=253 error=71 : operation not supported
Traceback (most recent call last):
  File "C:\Users\airium\Anaconda3\lib\multiprocessing\queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "C:\Users\airium\Anaconda3\lib\multiprocessing\reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "C:\Users\airium\Anaconda3\lib\site-packages\torch\multiprocessing\reductions.py", line 108, in reduce_storage
    metadata = storage._share_cuda_()
RuntimeError: cuda runtime error (71) : operation not supported at c:\programdata\miniconda3\conda-bld\pytorch_1524546371102\work\torch\csrc\generic\StorageSharing.cpp:253

Actually if I modify dev = 'cuda' to dev = 'cpu', it works:

tensor(0.)
tensor(1.)
tensor(2.)

So I am here seeking help to tackle this problem. My environment is Win10 1803, GTX1070, CUDA 9.1, python 3.6 + pytorch 0.4.0, running in conda. I also tried in native pip environment but got the same result. I want to figure out if I just incorrectly use pytorch’s multiprocessing or actually this operation is literally “not supported” on GPU. I think this problem may be related to multiprocessing’s start method, but I am not very familiar with it and there is only spawn available on Windows. Thanks in advance.

It’s actually not supported by CUDA. See details in the Windows docs.
https://pytorch.org/docs/stable/notes/windows.html#cuda-ipc-operations