I am trying to make use of multiprocessing to move data batches to GPU in a dedicated process. This process should get values from an input queue of python values or numpy arrays, transform them into pytorch’s cuda tensor, and put the result into an output queue. The main process of training model will then get and use the ready-to-use cuda tensor from the output queue without the need of further processing. My purpose is to save the time consumed by moving tensor to GPU, which may take up to 30% total time in a project I am working on depending on model size.
My experimental code is like this:
import torch
import torch.multiprocessing as mp
import time
def worker(alert, q1, q2, dev):
while True:
alert.wait()
alert.clear()
q2.put(torch.tensor(q1.get(), dtype=torch.float32, device=dev, requires_grad=False))
if __name__ == '__main__':
dev = 'cuda' # 'cpu' or 'cuda'
alert = mp.Event()
q1 = mp.Queue()
q2 = mp.Queue()
p = mp.Process(target=worker, args=(alert, q1, q2, dev), daemon=True)
p.start()
for i in range(3):
q1.put(i)
alert.set()
time.sleep(1)
for i in range(3):
print(q2.get())
But I got this error:
THCudaCheck FAIL file=c:\programdata\miniconda3\conda-bld\pytorch_1524546371102\work\torch\csrc\generic\StorageSharing.cpp line=253 error=71 : operation not supported
Traceback (most recent call last):
File "C:\Users\airium\Anaconda3\lib\multiprocessing\queues.py", line 234, in _feed
obj = _ForkingPickler.dumps(obj)
File "C:\Users\airium\Anaconda3\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "C:\Users\airium\Anaconda3\lib\site-packages\torch\multiprocessing\reductions.py", line 108, in reduce_storage
metadata = storage._share_cuda_()
RuntimeError: cuda runtime error (71) : operation not supported at c:\programdata\miniconda3\conda-bld\pytorch_1524546371102\work\torch\csrc\generic\StorageSharing.cpp:253
Actually if I modify dev = 'cuda'
to dev = 'cpu'
, it works:
tensor(0.)
tensor(1.)
tensor(2.)
So I am here seeking help to tackle this problem. My environment is Win10 1803, GTX1070, CUDA 9.1, python 3.6 + pytorch 0.4.0, running in conda. I also tried in native pip environment but got the same result. I want to figure out if I just incorrectly use pytorch’s multiprocessing or actually this operation is literally “not supported” on GPU. I think this problem may be related to multiprocessing’s start method, but I am not very familiar with it and there is only spawn
available on Windows. Thanks in advance.