Failed to create network model in a child process (CUDA initialization error)


I’ve encountered a very weird problem today, and managed to narrow it down to something quite basic. The code below:

from torchvision.models.inception import inception_v3 as get_model
from torch import multiprocessing
import os

def check_model():
    print(os.getpid(), 'check_model start')
    net = get_model()
    print(os.getpid(), 'number of parameters', len(list(net.parameters())))
    print(os.getpid(), 'device', next(net.parameters()).device)
    print(os.getpid(), 'check_model end')

def main():
    print(os.getpid(), 'main start')

    p = multiprocessing.Process(target=check_model)

    print(os.getpid(), 'main end with exitcode', p.exitcode)

if __name__ == '__main__':

Fails on the 2nd call to check_model, which is done through a new process (multiprocessing.Process). Please see the output below:

13232 main start
13232 check_model start
13232 number of parameters 292
13232 device cuda:0
13232 check_model end
13242 check_model start
13242 number of parameters 292
Process Process-1:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/", line 258, in _bootstrap
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "src/", line 10, in check_model
  File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/", line 249, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/", line 176, in _apply
  File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/", line 176, in _apply
  File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/", line 182, in _apply = fn(
  File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/", line 249, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error (3): initialization error
13232 main end with exitcode 1

If I replace the call to multiprocessing.Process with a regular call to the check_model function, the problem disappear. Also, if I remove the line net.cuda(device=0) the problem disappear. However, I need these functionalities for my code…
Why is this happening? I’ve tried different builtin models (VGG, ResNet) to no avail.
I’ve also tried calling the check_model twice through a child process (multiprocessing.Process) and it worked. It seems it only fails when a parent process created the net and then a child process creates another and tries to transfer it to the GPU.

Any advice will be appreciated, thanks.

Update: I only get the error on Linux (Ubuntu 16.04) machine; not on windows (10). I’m connected to the Ubuntu machine via ssh.

Same problem here.
Adding “torch.multiprocessing.set_start_method(“spawn”)” might solve this problem, but for me it brings others…