Validation crashes when num_workers > 0 with CUDA initialization error

Rohit_R · August 23, 2021, 5:43pm

Hi. Im training a model using DDP on 2 P100 GPUs. I notice that when I set the num_workers >0 for my val_dataloader the validation step on epoch 0 crashes. My train_dataloader has num_workers=4 and the sanity validation check runs fine. I have checked several similar issues but none seem to be the same as the one I’m facing. The model works great when validation num_workers=0. Please find the exact error output below.
My pytorch version is installed with cuda 10.2 but I am running my code on cuda 11.4. Could this be the source of the error?

Pytorch-lightning version = 1.4.2 , torch version = ‘1.9.0+cu102’.

Validation sanity check: 0it [00:00, ?it/s]/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:105: UserWarning: The dataloader, val 
dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 24 which is the number of cpus on this m
achine) in the `DataLoader` init to improve performance.                                                                                                                      
  rank_zero_warn(                                                                                                                                                             
Validation sanity check:   0%|                                                                                                                          | 0/1 [00:00<?, ?it/s]
/home/usr/pytorch/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subjec
t to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)                 
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)                                                                                           
/home/usr/pytorch/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subjec
t to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)                 
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)                                                                                           
Global seed set to 42                                                                                                                                                         
Global seed set to 42                                                                                                                                                         
Epoch 0:  80%|█████████████████████████████████████████████████████████████████████████████████████▌                     | 4/5 [00:14<00:02,  2.80s/it, loss=4.33, v_num=d09et
erminate called after throwing an instance of 'c10::CUDAError'                                                                                          | 0/1 [00:00<?, ?it/s]
  what():  CUDA error: initialization error                                                                                                                                   
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.                                                        
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                        
Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):                                                              
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2b5f7135ca22 in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)               
frame #1: <unknown function> + 0x10d7e (0x2b5f710ecd7e in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)                                        
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x2b5f710ee027 in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)          
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x2b5f713465a4 in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)                              
frame #4: <unknown function> + 0xa27e1a (0x2b5f1a569e1a in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)                                   
<omitting python frames>                                                                                                                                                      
                                                                                                                                                                              
terminate called after throwing an instance of 'c10::CUDAError'                                                                                                               
  what():  CUDA error: initialization error                                                                                                                                   
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.                                                        
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                        
Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):                                                              
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2b4b41756a22 in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)               
frame #1: <unknown function> + 0x10d7e (0x2b4b414e6d7e in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)                                        
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x2b4b414e8027 in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)          
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x2b4b417405a4 in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)                              
frame #4: <unknown function> + 0xa27e1a (0x2b4aea963e1a in /home/usr/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)                                   
<omitting python frames>                                                                                                                                                      
                                                                                                                                                                              
Traceback (most recent call last):                                                     
  File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
Traceback (most recent call last):                                             
File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/queues.py", line 107, in get
    data = self._data_queue.get(timeout=timeout)
  File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    if not self._poll(timeout):
  File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
    return self._poll(timeout)
  File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
  File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
    r = wait([self], timeout)
  File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 931, in wait
  File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
    ready = selector.select(timeout)
  File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/selectors.py", line 415, in select
  File "/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
    fd_event_list = self._selector.poll(timeout)
  File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
  File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3404) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/usr/mymodel/run.py", line 22, in <module>
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3407) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/usr/mymodel/run.py", line 22, in <module>
    main()
  File "/home/usr/mymodel/run.py", line 18, in main
   main()
  File "/home/usr/mymodel/run.py", line 18, in main
    return train(CFG)
  File "/scratch/usr/mymodel/src/train.py", line 110, in train
    return train(CFG)
  File "/scratch/usr/mymodel/src/train.py", line 110, in train
    trainer.fit(model,dm)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
    trainer.fit(model,dm)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
    self._run(model)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
    self._run(model)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
    self._dispatch()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
    self._dispatch()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
    self.accelerator.start_training(self)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.accelerator.start_training(self)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
    return self._run_train()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
    self.training_type_plugin.start_training(trainer)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self.fit_loop.run()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self._results = trainer.run_stage()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
    self.advance(*args, **kwargs)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    return self._run_train()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 112, in run
    self.on_advance_end()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 177, in on_advance_end

   self.fit_loop.run()                                                                                                                                                       
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self._run_validation()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 256, in _run_validation
    self.advance(*args, **kwargs)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
    self.val_loop.run()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 112, in run
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    self.on_advance_end()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 177, in on_advance_end
    dl_outputs = self.epoch_loop.run(
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self._run_validation()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 256, in _run_validation
    self.advance(*args, **kwargs)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 93, in advance
    self.val_loop.run()
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    batch_idx, batch = next(dataloader_iter)
  File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    self.advance(*args, **kwargs)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    data = self._next_data()
  File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
    self.advance(*args, **kwargs)
  File "/home/usr/pytorch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 93, in advance
    batch_idx, batch = next(dataloader_iter)
  File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    idx, data = self._get_data()
  File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
    data = self._next_data()
  File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
    success, data = self._try_get_data()
  File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
    idx, data = self._get_data()
  File "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3404) exited unexpectedly
    success, data = self._try_get_data()
ile "/home/usr/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 3407) exited unexpectedly

@ptrblck Would really appreciate it if you could take a look! Thank you!

mrshenli · August 24, 2021, 2:34am

cc @VitalyFedyunin for data loader questions

VitalyFedyunin · August 24, 2021, 2:58am

Hi! Can you please try with PyTorch nightly build, we made some changes into Cuda IPC code just recently, and I wonder if you will get same error (please paste it here if you are).

Rohit_R · August 24, 2021, 10:07am

@VitalyFedyunin Hi! Thanks for your reply. I noticed that with my current versions, I also get the error immediately when I set a PyTorch seed or after a few epochs without seed and val_workers>0. I also find that I get the error when I have bum_workers>0 and I try reading from a pickled file containing PyTorch tensors (even if i save as numpy arrays it errors). The error also occurs sometimes when I try writing a pickled file (I write only for local_rank=0 process, but read the pickled file from both GPU processes in the DDP)

The first 2 errors are resolved with the nightly build. However the third, where I read in PyTorch tensors from a pickled file and have val_workers > 0 still persists. With val_workers=0 there is no error. Any idea on how to resolve this would be great! Please find a few more details below.

I have a small dataset, so I load all the data in the __init__ of my Dataset class. I then save it on my disk using pickle so I can save on dataloading time when I run my code again. Now, since I have 2 GPUs, DDP in pytorch-lightning starts 2 processes and each of these processes start reading from the pickle file. Both the training data and validation data are being read from pickle files. Epoch 0 training happens successfully but as soon as its starts validation, it crashes with the below error.

The error I get is -

CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.                                                         [17/1779]
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                                  
Exception raised from insert_events at ../c10/cuda/CUDACachingAllocator.cpp:1243 (most recent call first):                                                                              
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2b74fab78a52 in /home/usr/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libc10.so)                 
frame #1: <unknown function> + 0x1c8ce (0x2b74fa8fc8ce in /home/usr/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)                                          
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x2b74fa8fcee2 in /home/usr/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)            
frame #3: c10::TensorImpl::release_resources() + 0x9c (0x2b74fab6205c in /home/usr/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libc10.so)                                
frame #4: <unknown function> + 0x292b79 (0x2b749d542b79 in /home/usr/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libtorch_python.so)                                     
frame #5: <unknown function> + 0xacd961 (0x2b749dd7d961 in /home/usr/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libtorch_python.so)                                     
<omitting python frames>                                                                                                                                                                
                                                                                                                                                                                        
^C/home/usr/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1047: UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...        
  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")                                                                                                         
^CException ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x2b74fc2c85e0>                                                                                             
Traceback (most recent call last):                                                                                                                                                      
  File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__                                                               
    self._shutdown_workers()                                                                                                                                                            
  File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1301, in _shutdown_workers                                                     
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)                                                                                                                                     
  File "/cvmfs/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/process.py", line 149, in join                                       
    res = self._popen.wait(timeout)                                                                                                                                                     
  File "/cvmfs/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/popen_fork.py", line 44, in wait                                     
    if not wait([self.sentinel], timeout):                                                                                                                                              
  File "/cvmfs/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 931, in wait                                    
    ready = selector.select(timeout)                                                                                                                                                    
  File "/cvmfs/server/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/selectors.py", line 415, in select                                                   
    fd_event_list = self._selector.poll(timeout)                                                                                                                                        
KeyboardInterrupt:                            
^CTraceback (most recent call last):                                                                                                                                                    
  File "/home/usr/mydir/mymodel/run.py", line 22, in <module>            
    main()                                                                                                                                                                              
  File "/home/usr/mydir/mymodel/run.py", line 18, in main                
    return train(CFG)                                                                                                                                                                   
  File "/mydir/usr/mymodel/mymodelsub/train.py", line 110, in train                                                                                         
    trainer.fit(model,dm)                                                                                                                                                                 File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit                                                              
    self._run(model)                                                                                                                                                                    
  File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 921, in _run
    self._post_dispatch()
  File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 976, in _post_dispatch
    self.accelerator.teardown()
  File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu.py", line 57, in teardown
    super().teardown()
  File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 157, in teardown
    self.training_type_plugin.teardown()
  File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/parallel.py", line 143, in teardown
    self.lightning_module.cpu()
  File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 136, in cpu
    return super().cpu()
  File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in cpu
    return self._apply(lambda t: t.cpu())
  File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/torch/nn/modules/module.py", line 533, in _apply
    module._apply(fn)
  File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/torch/nn/modules/module.py", line 533, in _apply
    module._apply(fn)
  File "/home/usr/pytorch_nightly/lib/python3.8/site-packages/torch/nn/modules/module.py", line 533, in _apply
    module._apply(fn)

tinan5 · August 25, 2021, 11:44am

Hi.
I have the same error message, however it is transient for me, i.e., happens randomly either during training or validation. I have seen the issue occur only with DDP (multiple GPUs), single GPU runs without DDP work fine.
I am also using multiple data workers for training and validation.

I am using torch==1.9.0+cu111 and Pytorch-lightning==1.4.2.

Here is the start of the error:

terminate called after throwing an instance of 'c10::CUDAError'                                                                                                                           what():  CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.                                                                  For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                                  Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f73757b9a22 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)                                     
...

Rohit_R · August 25, 2021, 12:33pm

Hi @tinan5. Oh thats interesting. Are you by any chance reading/writing data from/to a pickle file or hdf5 file? For me this was mostly the issue. In my __init__ method, I was previously writing data as a pickle file. When I removed this, my code works fine. I guess its a issue with reading and writing files during multiprocessing. I found this issue Data Loader does not work with Hdf5 file, when num_worker >1 · Issue #11929 · pytorch/pytorch · GitHub to be quite similar. Please let me know if you get any more insights.

Also you could try with the nightly version. It fixed most of the other issues which were giving the same error for me.

tinan5 · August 25, 2021, 1:22pm

Hi @Rohit_R. I am reading from disk but not writing and not with HDF5 format. Perhaps the nightly version can fix the DDP issues. I will try it.

Rohit_R · August 25, 2021, 4:55pm

@VitalyFedyunin @tinan5 I got the same error after few epochs in the nightly version as well…Does not seem to be fixed…

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: initialization error
Exception raised from insert_events at ../c10/cuda/CUDACachingAllocator.cpp:1243 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2ab1c8b51a52 in /home/me/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1c8ce (0x2ab1c88d58ce in /home/me/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x2ab1c88d5ee2 in /home/me/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x9c (0x2ab1c8b3b05c in /home/me/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x292b79 (0x2ab16b51bb79 in /home/me/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xacd961 (0x2ab16bd56961 in /home/me/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: initialization error
Exception raised from insert_events at ../c10/cuda/CUDACachingAllocator.cpp:1243 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2ab1c8b51a52 in /home/me/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1c8ce (0x2ab1c88d58ce in /home/me/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x2ab1c88d5ee2 in /home/me/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x9c (0x2ab1c8b3b05c in /home/me/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x292b79 (0x2ab16b51bb79 in /home/me/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xacd961 (0x2ab16bd56961 in /home/me/pytorch_nightly/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #36: <unknown function> + 0x91d1 (0x2ab15dae61d1 in /Core/python/3.8.10/lib/python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so)
frame #37: <unknown function> + 0x11821 (0x2ab15daee821 in /Core/python/3.8.10/lib/python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so)
frame #38: <unknown function> + 0x1052d (0x2ab15daed52d in /Core/python/3.8.10/lib/python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so)
frame #39: <unknown function> + 0x12263 (0x2ab15daef263 in /Core/python/3.8.10/lib/python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so)
frame #40: <unknown function> + 0x113d1 (0x2ab15daee3d1 in /Core/python/3.8.10/lib/python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so)
frame #41: <unknown function> + 0x10c62 (0x2ab15daedc62 in /Core/python/3.8.10/lib/python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so)
frame #42: <unknown function> + 0x1052d (0x2ab15daed52d in /Core/python/3.8.10/lib/python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so)
frame #43: <unknown function> + 0x15090 (0x2ab15daf2090 in /Core/python/3.8.10/lib/python3.8/lib-dynload/_pickle.cpython-38-x86_64-linux-gnu.so)
frame #61: <unknown function> + 0x7f27 (0x2ab15c357f27 in /cvmfs/server/gentoo/2020/lib64/libpthread.so.0)
frame #62: clone + 0x3f (0x2ab15c47087f in /cvmfs/server/gentoo/2020/lib64/libc.so.6)

Traceback (most recent call last):
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/Core/python/3.8.10/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/Core/python/3.8.10/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/Core/python/3.8.10/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 20907) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
    self.fit_loop.run()
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 112, in run
    self.on_advance_end()
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 177, in on_advance_end
    self._run_validation()
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 256, in _run_validation
    self.val_loop.run()
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 93, in advance
    batch_idx, batch = next(dataloader_iter)
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
    idx, data = self._get_data()
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
    success, data = self._try_get_data()
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 20907) exited unexpectedly

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/me/scratch/myproj/run.py", line 20, in <module>
    main()
  File "/home/me/scratch/myproj/run.py", line 16, in main
    return train(CFG)
  File "/scratch/me/myproj/myprojsub/train.py", line 111, in train
    trainer.fit(model,dm)
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
    self._run(model)
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
    self._dispatch()
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
    self.accelerator.start_training(self)
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
    return self._run_train()
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run_train
    self.training_type_plugin.reconciliate_processes(traceback.format_exc())
  File "/home/me/pytorch_nightly/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 451, in reconciliate_processes
    os.kill(pid, signal.SIGKILL)
ProcessLookupError: [Errno 3] No such process

Rohit_R · August 25, 2021, 5:03pm

@VitalyFedyunin @tinan5 Its mostly PL bug. Check this out - PyTorch Lightning 1.4.1 crashes during training · Issue #8821 · PyTorchLightning/pytorch-lightning · GitHub.