Hi,
I’ve encountered a very weird problem today, and managed to narrow it down to something quite basic. The code below:
from torchvision.models.inception import inception_v3 as get_model
from torch import multiprocessing
import os
def check_model():
print(os.getpid(), 'check_model start')
net = get_model()
print(os.getpid(), 'number of parameters', len(list(net.parameters())))
net.cuda(device=0)
print(os.getpid(), 'device', next(net.parameters()).device)
print(os.getpid(), 'check_model end')
def main():
print(os.getpid(), 'main start')
check_model()
p = multiprocessing.Process(target=check_model)
p.start()
p.join()
print(os.getpid(), 'main end with exitcode', p.exitcode)
if __name__ == '__main__':
main()
Fails on the 2nd call to check_model
, which is done through a new process (multiprocessing.Process
). Please see the output below:
13232 main start
13232 check_model start
13232 number of parameters 292
13232 device cuda:0
13232 check_model end
13242 check_model start
13242 number of parameters 292
Process Process-1:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "src/tmp2.py", line 10, in check_model
net.cuda(device=0)
File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 249, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 176, in _apply
module._apply(fn)
File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 176, in _apply
module._apply(fn)
File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 182, in _apply
param.data = fn(param.data)
File "/home/ubuntu/venvs/algo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 249, in <lambda>
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error (3): initialization error
13232 main end with exitcode 1
If I replace the call to multiprocessing.Process
with a regular call to the check_model
function, the problem disappear. Also, if I remove the line net.cuda(device=0)
the problem disappear. However, I need these functionalities for my code…
Why is this happening? I’ve tried different builtin models (VGG, ResNet) to no avail.
I’ve also tried calling the check_model
twice through a child process (multiprocessing.Process
) and it worked. It seems it only fails when a parent process created the net and then a child process creates another and tries to transfer it to the GPU.
Any advice will be appreciated, thanks.
Ran