Hi. Im training a model using DDP on 2 P100 GPUs. I notice that when I set the num_workers >0
for my val_dataloader
the validation step on epoch 0 crashes. My train_dataloader
has num_workers=4
and the sanity validation check runs fine. I have checked several similar issues but none seem to be the same as the one Iβm facing. The model works great when validation num_workers=0
. Please find the exact error output below.
My pytorch version is installed with cuda 10.2 but I am running my code on cuda 11.4. Could this be the source of the error?
Pytorch-lightning version = 1.4.2 , torch version = β1.9.0+cu102β.
Validation sanity check: 0it [00:00, ?it/s]/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:105: UserWarning: The dataloader, val
dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 24 which is the number of cpus on this m
achine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Validation sanity check: 0%| | 0/1 [00:00<?, ?it/s]
/home/usr/pytorch/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subjec
t to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/home/usr/pytorch/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subjec
t to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Global seed set to 42
Global seed set to 42
Epoch 0: 80%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 4/5 [00:14<00:02, 2.80s/it, loss=4.33, v_num=d09et
erminate called after throwing an instance of 'c10::CUDAError' | 0/1 [00:00<?, ?it/s]
what(): CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2b5f7135ca22 in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10d7e (0x2b5f710ecd7e in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x2b5f710ee027 in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x2b5f713465a4 in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa27e1a (0x2b5f1a569e1a in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2b4b41756a22 in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10d7e (0x2b4b414e6d7e in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x2b4b414e8027 in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x2b4b417405a4 in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa27e1a (0x2b4aea963e1a in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
Traceback (most recent call last):
File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
Traceback (most recent call last):
File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/queues.py", line 107, in get
data = self._data_queue.get(timeout=timeout)
File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 257, in poll
if not self._poll(timeout):
File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
return self._poll(timeout)
File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
r = wait([self], timeout)
File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 931, in wait
File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
ready = selector.select(timeout)
File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/selectors.py", line 415, in select
File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
fd_event_list = self._selector.poll(timeout)
File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3404) is killed by signal: Aborted.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/usr/mymodel/run.py", line 22, in <module>
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3407) is killed by signal: Aborted.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/usr/mymodel/run.py", line 22, in <module>
main()
File "/home/usr/mymodel/run.py", line 18, in main
main()
File "/home/usr/mymodel/run.py", line 18, in main
return train(CFG)
File "/scratch/usr/mymodel/src/train.py", line 110, in train
return train(CFG)
File "/scratch/usr/mymodel/src/train.py", line 110, in train
trainer.fit(model,dm)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
trainer.fit(model,dm)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._run(model)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self._dispatch()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.accelerator.start_training(self)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.training_type_plugin.start_training(trainer)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self.fit_loop.run()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self._results = trainer.run_stage()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
self.advance(*args, **kwargs)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
return self._run_train()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
epoch_output = self.epoch_loop.run(train_dataloader)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 112, in run
self.on_advance_end()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 177, in on_advance_end
self.fit_loop.run()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self._run_validation()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 256, in _run_validation
self.advance(*args, **kwargs)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
self.val_loop.run()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 112, in run
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
self.on_advance_end()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 177, in on_advance_end
dl_outputs = self.epoch_loop.run(
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self._run_validation()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 256, in _run_validation
self.advance(*args, **kwargs)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 93, in advance
self.val_loop.run()
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
batch_idx, batch = next(dataloader_iter)
File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
self.advance(*args, **kwargs)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
dl_outputs = self.epoch_loop.run(
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
data = self._next_data()
File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
self.advance(*args, **kwargs)
File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 93, in advance
batch_idx, batch = next(dataloader_iter)
File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
idx, data = self._get_data()
File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
data = self._next_data()
File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
success, data = self._try_get_data()
File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
idx, data = self._get_data()
File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3404) exited unexpectedly
success, data = self._try_get_data()
ile "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3407) exited unexpectedly
@ptrblck Would really appreciate it if you could take a look! Thank you!