A call to torch.cuda.is_available makes an unrelated multi-processing computation crash?

This piece of code:

import torch

def create_one(batch_size):
    return torch.ByteTensor(batch_size, 128, 128)

# torch.cuda.is_available()                                                                                                                                                                    

pool = torch.multiprocessing.Pool(torch.multiprocessing.cpu_count())

data = pool.map(create_one, torch.LongTensor(1000).fill_(100))

works fine, but if the cuda.is_available line is uncommented, it crashes with the following error:

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/fleuret/misc/anaconda3/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/home/fleuret/misc/anaconda3/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/fleuret/misc/anaconda3/lib/python3.5/multiprocessing/pool.py", line 429, in _handle_results
    task = get()
  File "/home/fleuret/misc/anaconda3/lib/python3.5/multiprocessing/connection.py", line 251, in recv
    return ForkingPickler.loads(buf.getbuffer())
  File "/home/fleuret/misc/anaconda3/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/home/fleuret/misc/anaconda3/lib/python3.5/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/fleuret/misc/anaconda3/lib/python3.5/multiprocessing/reduction.py", line 181, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/fleuret/misc/anaconda3/lib/python3.5/multiprocessing/reduction.py", line 160, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata

Did I miss something? Works fine on a machine without cuda.

when using multiprocessing with CUDA, it is important to use the spawn method instead of the default fork method.
http://pytorch.org/docs/notes/multiprocessing.html#sharing-cuda-tensors

import torch.multiprocessing as multiprocessing
multiprocessing.set_start_method('spawn')

This is a restriction of CUDA/NVIDIA.

2 Likes

So this should work as-is?

#!/usr/bin/env python                                                                                                                                                                          

import torch
import torch.multiprocessing as multiprocessing

if torch.cuda.is_available():
    multiprocessing.set_start_method('spawn')

def create_one(batch_size):
    return torch.ByteTensor(batch_size, 128, 128)

pool = torch.multiprocessing.Pool(torch.multiprocessing.cpu_count())

data = pool.map(create_one, torch.LongTensor(1000).fill_(100))

Because it still does not:

  File "/home/fleuret/misc/anaconda3/lib/python3.5/multiprocessing/context.py", line 231, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

my understanding was incorrect for the reason of the error.

According to https://github.com/pytorch/pytorch/issues/973 what is actually happening is that the multiprocessing Pool has a rotating set of workers, which are not waiting for the transfer of the file descriptors (of Tensor sharedmem) to finish from worker to main process, before which the workers are being recycled.

We’ll investigate this further on https://github.com/pytorch/pytorch/issues/973

I noticed that

pool = multiprocessing.Pool(multiprocessing.cpu_count(), maxtasksperchild=1

does fix the error (remove the set_start_method from the code, i’ll look into that separately). However, having only one task per child before it’s reinitialized is probably not very efficient.

Looks like set_start_method did not work for me but mp = mp.get_context('spawn') did. Hope that provides some help.

4 Likes

I’ve also tried using forkserver as a start method and found an non-negligible improvement in speed, though this is noted in the multiprocessing docs (I’m referring specifically to python 3): https://docs.python.org/3/library/multiprocessing.html

Any news wrt this, people?

Hi!

Using @Michael_Petrochuk solution is enough:

import torch.multiprocessing
...
def test_worker_in_mpi():
    mp = torch.multiprocessing.get_context('forkserver')
    pool = mp.Pool(processes=1) # ,maxtasksperchild=1
    rez = pool.map(worker, params_iterator) 

And this was a pytest. At least for my toy example (inside the worker I create a 2 layer net and pass a tensor through it) it works without any main guard, maxtasksperchild=1 or
set_start_method

Thanks!

most probably i meet the same issue here. I’m using Flask to server a web service which workd as multi-processing to predict images. Then i got following error.

Exception in thread Thread-19:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/content/flask-video-streaming/base_camera.py", line 97, in _thread
    for frame in frames_iterator:
  File "<ipython-input-22-98a25d1b92f5>", line 38, in frames
    predictor = DefaultPredictor(cfg)
  File "/content/detectron2_repo/detectron2/engine/defaults.py", line 163, in __init__
    self.model = build_model(self.cfg)
  File "/content/detectron2_repo/detectron2/modeling/meta_arch/build.py", line 19, in build_model
    return META_ARCH_REGISTRY.get(meta_arch)(cfg)
  File "/content/detectron2_repo/detectron2/modeling/meta_arch/rcnn.py", line 41, in __init__
    pixel_mean = torch.Tensor(cfg.MODEL.PIXEL_MEAN).to(self.device).view(num_channels, 1, 1)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 197, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at /pytorch/aten/src/THC/THCGeneral.cpp:54

after some google searching, i tried following solutions but none of them working. Anyone have some suggestions to let me go forward? thanks in advance!

if __name__ == "__main__":

#    manager.run(host='0.0.0.0', threaded=True)

    print(torch.multiprocessing.get_start_method())

    torch.multiprocessing.set_start_method('spawn', force=True)

    #torch.multiprocessing = torch.multiprocessing.get_context('spawn')

    #torch.multiprocessing.set_start_method('spawn')

    #torch.multiprocessing.set_start_method('forkserver', force=True)

    app.run(host='0.0.0.0', threaded=False, processes=2)

So, I have the same exact error.

I am running this code from GoodNews repo:

[jalal@goku GoodNews]$ python train.py --cnn_weight data/resnet152-b121ed2d.pth
DataLoader loading json file:  data/data_news.json
vocab size is  37200
DataLoader loading h5 file:  data/data_news_label.h5 /scratch2/goodnewsdata/data_news_image.h5
read 489229 images of size 3x256x256
max sequence length in data is 31
assigned 445433 images to split train
assigned 19376 images to split val
assigned 24420 images to split test
WARNING:tensorflow:From train.py:49: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch-nightly_1553749772122/work/aten/src/THC/THCGeneral.cpp line=51 error=3 : initialization error
Traceback (most recent call last):
  File "train.py", line 280, in <module>
    train(opt)
  File "train.py", line 81, in train
    cnn_model.cuda()
  File "/scratch/sjn-p3/anaconda/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 263, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/scratch/sjn-p3/anaconda/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 190, in _apply
    module._apply(fn)
  File "/scratch/sjn-p3/anaconda/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 196, in _apply
    param.data = fn(param.data)
  File "/scratch/sjn-p3/anaconda/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 263, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/scratch/sjn-p3/anaconda/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch-nightly_1553749772122/work/aten/src/THC/THCGeneral.cpp:51
Closing remaining open files:data/data_news_label.h5...done/scratch2/goodnewsdata/data_news_image.h5...done

These files have the keyword ‘multiprocessing’:

[jalal@goku GoodNews]$ rg 'multiprocessing'
dataloader.py
11:# import multiprocessing
12:# from multiprocessing import Process
13:# from multiprocessing.dummy import Pool as ThreadPool
14:# from pathos.multiprocessing import Pool

get_data/get_imgs_only/get_images.py
12:# from multiprocessing.dummy import Pool as ThreadPool

get_data/with_article_urls/get_data_with_urls.py
14:# from multiprocessing.dummy import Pool as ThreadPool

get_data/with_api/get_data_api.py
9:import multiprocessing
15:from multiprocessing.dummy import Pool as ThreadPool

However, I am not sure how to proceed. Could you please have a look at the files in the repo?

I have:

$ nvidia-smi
Fri Sep 18 23:25:50 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
|  0%   24C    P8    19W / 250W |     55MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:06:00.0 Off |                  N/A |
|  0%   27C    P8    12W / 250W |      2MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2637      G   /usr/bin/X                         39MiB |
|    0   N/A  N/A      2903      G   /usr/bin/gnome-shell               12MiB |
+-----------------------------------------------------------------------------+

and

$ python
Python 3.6.7 | packaged by conda-forge | (default, Nov  6 2019, 16:19:42) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import torch
>>> torch.__version__
'1.0.0.dev20190328'

>>> torch.version.cuda
'8.0.61'
>>> torch.cuda.is_available()
False

>>> torch.backends.cudnn.enabled
True

$ cat /usr/local/cuda/version.txt 
CUDA Version 10.0.130

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

I am not sure how to fix the problem by looking at https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver

More information here https://stackoverflow.com/questions/63965122/fixing-torch-cuda-is-available-when-it-is-false