Hi,
I’m having trouble with multiple processes working on the same GPU. I wrote minimal error-reproducing example.
I ran the example code successfully on my local machine, using CUDA 10.2 and pytorch 1.2.0.
While this works just fine, it fails to run on a cluster with CUDA 10.1 and pytorch 1.2.0.
Does anybody know why or how to overcome this? Thanks a ton.
CODE EXAMPLE
import torch.multiprocessing as _mp
import torch
import os
import time
import numpy as np
mp = _mp.get_context('spawn')
class Process(mp.Process):
def __init__(self, id):
super().__init__()
print("Init Process")
self.id = id
return
def run(self):
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
for i in range(3):
with torch.cuda.device(0):
x = torch.Tensor(10).to(0)
x.to('cpu')
del x
time.sleep(np.random.random())
if __name__ == "__main__":
num_processes = 2
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
processes = [Process(i) for i in range(num_processes)]
[p.start() for p in processes]
[p.join() for p in processes]
ERROR
Process Process-2:
Traceback (most recent call last):
File "/cluster/home/marksm/software/anaconda/envs/test/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/cluster/home/marksm/mp_demonstration.py", line 20, in run
x = torch.Tensor(10).to(0)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable