I am facing a runtime error when running training.py in chapter 11 of dlwpt book.
2020-11-10 04:42:43,159 INFO pid:14780 __main__:082:initModel Using CUDA; 8 devices.
2020-11-10 04:42:44,775 INFO pid:14780 __main__:141:main Starting LunaTrainingApp, Namespace(batch_size=4, comment='dwlpt', epochs=1, num_workers=32, tb_prefix='p2ch11')
2020-11-10 04:42:47,521 INFO pid:14780 dsets:182:__init__ <dsets.LunaDataset object at 0x7fbd5d2b40a0>: 198764 training samples
2020-11-10 04:42:47,534 INFO pid:14780 dsets:182:__init__ <dsets.LunaDataset object at 0x7fbcef6f9d90>: 22085 validation samples
2020-11-10 04:42:47,534 INFO pid:14780 __main__:148:main Epoch 1 of 1, 6212/691 batches of size 4*8
2020-11-10 04:42:47,535 WARNING pid:14780 util:219:enumerateWithEstimate E1 Training ----/6212, starting
Traceback (most recent call last):
File "/home/user/anaconda3/envs/pytorch_updated/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 779, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/user/anaconda3/envs/pytorch_updated/lib/python3.8/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/user/anaconda3/envs/pytorch_updated/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
File "/home/user/anaconda3/envs/pytorch_updated/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 14850) is killed by signal: Killed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/pytorch_1/dlwpt-code-master/p2ch11/training.py", line 390, in <module>
LunaTrainingApp().main()
File "/home/user/pytorch_1/dlwpt-code-master/p2ch11/training.py", line 157, in main
trnMetrics_t = self.doTraining(epoch_ndx, train_dl)
File "/home/user/pytorch_1/dlwpt-code-master/p2ch11/training.py", line 181, in doTraining
for batch_ndx, batch_tup in batch_iter:
File "/home/user/pytorch_1/dlwpt-code-master/util/util.py", line 224, in enumerateWithEstimate
for (current_ndx, item) in enumerate(iter):
File "/home/user/anaconda3/envs/pytorch_updated/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
data = self._next_data()
File "/home/user/anaconda3/envs/pytorch_updated/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 974, in _next_data
idx, data = self._get_data()
File "/home/user/anaconda3/envs/pytorch_updated/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 931, in _get_data
success, data = self._try_get_data()
File "/home/user/anaconda3/envs/pytorch_updated/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 792, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 14850) exited unexpectedly
Process finished with exit code 1
I tried lowering the batch-size
and increasing num-workers
up to 32 (as shown above), but still the error isn’t disappearing. Defaults values should work considering that I have 8 GPUs. What is the issue here?