Hi, I have some code that was working with PyTorch a couple releases ago.
But with the latest pip version (stable, Linux, CUDA 10.0, Python 3.7) I get an error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method"
But I’m not using multiprocessing. Or DataParallel either.
The extent of my ‘parallelization’ is the following…
def worker_init(worker_id):
"""
used with PyTorch DataLoader so that we can grab random bits of files or
synth random input data on the fly
Without this you get the same thing every epoch
"""
# NOTE that this current implementation prevents strict reproducability
np.random.seed()
Then in my training code…
dataloader = DataLoader(my_dataset, ...., num_workers=10, worker_init_fn=worker_init)
and the error occurs at the following line in my code:
for x, y in dataloader:
The error is:
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
…more lines of error message, ending with…
"Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
I was not aware I was re-initializing CUDA.
As a possible fix anyway, I tried adding the lines of code that I see recommended if one were using mutliprocessing:
from torch.multiprocessing import Pool, Process, set_start_method, cpu_count
try:
set_start_method('spawn')
except RuntimeError:
pass
…but these have no effect.
As per this thread, my dataloader loads a dataset my_dataset
which is purely numpy arrays on the CPU, and moves data to the GPU one line after the error is occuring…
for x, y in dataloader: # << error occurs here
x_cuda, y_cuda = x.to(device), y.to(device)
But unlike that aforementioned thread, I definitely want to keep multiple workers in my DataLoader! But even if I set num_workers=1
, I still get the same error.
So… the other answers I’ve seen seem to assume you’re using multiprocessing, but I’m not.
Can anyone suggest how to fix this?
Thanks.
EDIT: Full trace follows…
Traceback (most recent call last):
File "main_script.py", line 99, in <module>
apex_opt=args.apex, target_type=args.target, lr_max=args.lrmax, in_checkpointname=args.checkpoint)
File "mycode.py", line 267, in train
y_size, parallel, logfilename, out_checkpointname, sr=sr, lr_max=lr_max)
File "mycode.py", line 104, in train_loop
for x, y in dataloader:
File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
return self._process_data(data)
File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
data.reraise()
File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 80, in default_collate
return [default_collate(samples) for samples in transposed]
File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 80, in <listcomp>
return [default_collate(samples) for samples in transposed]
File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 65, in default_collate
return default_collate([torch.as_tensor(b) for b in batch])
File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 65, in <listcomp>
return default_collate([torch.as_tensor(b) for b in batch])
File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/cuda/__init__.py", line 177, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
P.S.- Setting pin_memory
to either True or False yields the same error.
P.P.S.- One thing on my system that did change: I downgraded from CUDA 10.1 to 10.0. Is it possible that this error message is really indicating some sort of CUDA version mismatch resulting from (perhaps) incompletely removing CUDA 10.1?