Multiprocessing for cuda error

I want to use torch.multiprocessing to accelerate my loop, however there are some errors .
I can’t absolutely understand the shared cuda menmery for subprocess .
Does anyone give some explanations ?

from torch.multiprocessing import Pool

def use_gpu():
    t = []
    for i in range(5):
        time.sleep(1)
        a = torch.randn(1000, 1000).cuda(3)      
        t.append(a)
    return t

if __name__ == "__main__":
    # torch.cuda.set_device(3)
    pool = Pool()
    result = []
    a = time.time()
    for i in range(10):
        result.append(pool.apply_async(use_gpu))
    pool.close()
    pool.join()
    print("cost time :", time.time() - a)

This snippet can work well .
if I used torch.cuda.set_device(3) for allocation gpu, there are some errors as follow .

THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=74 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=74 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=74 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=74 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=74 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=74 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=74 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=74 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=74 error=3 : initialization error
THCudaCheck FAIL file=/pytorch/torch/lib/THC/THCGeneral.c line=74 error=3 : initialization error

What cause it and how to solve it ?
Thanks in advance .

1 Like

When using multiprocessing and CUDA, as mentioned here you have to use start method that is not fork. For example:

import torch
torch.multiprocessing.set_start_method('spawn')

Thanks a lot for the help so far . After adding torch.multiprocessing .set_start_method(“spawn”), there arise a new problem:

Traceback (most recent call last):
  File "test9.py", line 45, in <module>
    torch.multiprocessing.set_start_method('spawn')
  File "/usr/local/lib/python3.5/multiprocessing/context.py", line 231, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

Is there a way around this ?

2 Likes

This usually happens if you have not properly wrapped you main in if __name__ == '__main__': construct. Another issue might be that your project has multiple files which have that construct, and these will set the contexts when importing these files.

  • One option is to have only a single entry point in your project which is properly wrapped in that construct,
  • Or you can call the set_start_method with the argument force=True
import torch
torch.multiprocessing.set_start_method('spawn', force=True)

Hope it helps. :blush:

5 Likes

I had a slightly different problem when training multiple models in multiple gpu in parallel. This is the only answer I found helpful so I hope I can bring it up here.

It can be shown by modifying the example based on the above suggestion. Here I want to run the processes (use_gpu) in all the devices in parallel.

import time
import torch
from torch.multiprocessing import Pool
torch.multiprocessing.set_start_method('spawn', force=True)


def use_gpu(ind):
    t = []
    for i in range(5):
        time.sleep(1)
        a = torch.randn(1000, 1000).cuda(ind)
        t.append(a)
    return t

if __name__ == "__main__":
    # torch.cuda.set_device(3)
    pool = Pool()
    result = []
    a = time.time()
    for i in range(torch.cuda.device_count()):
        result.append(pool.apply_async(use_gpu, (i,)))
    pool.close()
    pool.join()
    print("cost time :", time.time() - a)

However I couldn’t figure out why I got the error message

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/THCTensorRandom.cu line=25 error=46 : all CUDA-capable devices are busy or unavailable
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/THCTensorRandom.cu line=25 error=46 : all CUDA-capable devices are busy or unavailable
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/THCTensorRandom.cu line=25 error=46 : all CUDA-capable devices are busy or unavailable

Thanks for the help.

I have the same problem. How did you fix it?

same as you. Have you solved this problem??

same as you. Have you fix it now?