Failed to create network model in a child process (CUDA initialization error)

ranshadmi · September 9, 2018, 6:51pm

Hi,

I’ve encountered a very weird problem today, and managed to narrow it down to something quite basic. The code below:

from torchvision.models.inception import inception_v3 as get_model
from torch import multiprocessing
import os

def check_model():
    print(os.getpid(), 'check_model start')
    net = get_model()
    print(os.getpid(), 'number of parameters', len(list(net.parameters())))
    net.cuda(device=0)
    print(os.getpid(), 'device', next(net.parameters()).device)
    print(os.getpid(), 'check_model end')

def main():
    print(os.getpid(), 'main start')
    check_model()

    p = multiprocessing.Process(target=check_model)
    p.start()
    p.join()

    print(os.getpid(), 'main end with exitcode', p.exitcode)

if __name__ == '__main__':
    main()

Fails on the 2nd call to check_model, which is done through a new process (multiprocessing.Process). Please see the output below:

13232 main start
13232 check_model start
13232 number of parameters 292
13232 device cuda:0
13232 check_model end
13242 check_model start
13242 number of parameters 292
Process Process-1:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "src/tmp2.py", line 10, in check_model
    net.cuda(device=0)
  File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 249, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 176, in _apply
    module._apply(fn)
  File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 176, in _apply
    module._apply(fn)
  File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 182, in _apply
    param.data = fn(param.data)
  File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 249, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error (3): initialization error
13232 main end with exitcode 1

If I replace the call to multiprocessing.Process with a regular call to the check_model function, the problem disappear. Also, if I remove the line net.cuda(device=0) the problem disappear. However, I need these functionalities for my code…
Why is this happening? I’ve tried different builtin models (VGG, ResNet) to no avail.
I’ve also tried calling the check_model twice through a child process (multiprocessing.Process) and it worked. It seems it only fails when a parent process created the net and then a child process creates another and tries to transfer it to the GPU.

Any advice will be appreciated, thanks.
Ran

ranshadmi · September 10, 2018, 9:49pm

Update: I only get the error on Linux (Ubuntu 16.04) machine; not on windows (10). I’m connected to the Ubuntu machine via ssh.

rouniuyizu · March 14, 2019, 9:36am

Same problem here.
Adding “torch.multiprocessing.set_start_method(“spawn”)” might solve this problem, but for me it brings others…