Runtime error while multiprocessing

damaggu · December 9, 2019, 7:23pm

Hi,

I’m having trouble with multiple processes working on the same GPU. I wrote minimal error-reproducing example.
I ran the example code successfully on my local machine, using CUDA 10.2 and pytorch 1.2.0.
While this works just fine, it fails to run on a cluster with CUDA 10.1 and pytorch 1.2.0.

Does anybody know why or how to overcome this? Thanks a ton.

CODE EXAMPLE

import torch.multiprocessing as _mp
import torch
import os
import time
import numpy as np

mp = _mp.get_context('spawn')

class Process(mp.Process):
    def __init__(self, id):
        super().__init__()
        print("Init Process")
        self.id = id
        return

    def run(self):
        os.environ['CUDA_VISIBLE_DEVICES'] = '0'
        for i in range(3):
            with torch.cuda.device(0):
                x = torch.Tensor(10).to(0)
                x.to('cpu')
                del x
            time.sleep(np.random.random())

if __name__ == "__main__":
    num_processes = 2
    os.environ['CUDA_VISIBLE_DEVICES'] = '0'
    processes = [Process(i) for i in range(num_processes)]
    [p.start() for p in processes]
    [p.join() for p in processes]

ERROR

Process Process-2:
Traceback (most recent call last):
  File "/cluster/home/marksm/software/anaconda/envs/test/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/cluster/home/marksm/mp_demonstration.py", line 20, in run
    x = torch.Tensor(10).to(0)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable