Dataloader randomly hanguing

Hi,

My program started to hang up randomly. I prepend the variable CUDA_LAUNCH_BLOCK=1, but it doesn’t show me any error line related to my code. Running with workers=0 takes ~7.25s per iteration (~2 hours per epoch) :disappointed:

Any tips for debugging that. Patience?

what():  CUDA error: initialization error
Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1595629411241/work/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first):frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f83d65fc77d in /home/user/install/bin/miniconda3/envs/epic-models-plus/lib/python3.8/site-packages/torch/lib/libc10.so)frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1130 (0x7f83d684d370 in /home/user/install/bin/miniconda3/envs/epic-models-plus/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f83d65e8b1d in /home/user/install/bin/miniconda3/envs/epic-models-plus/lib/python3.8/site-packages/torch/lib/libc10.so)frame #3: <unknown function> + 0x53956b (0x7f840fe7956b in /home/user/install/bin/miniconda3/envs/epic-models-plus/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>

RuntimeError('DataLoader worker (pid(s) 13152) exited unexpectedly')
> /home/user/install/bin/miniconda3/envs/epic-models-plus/lib/python3.8/site-packages/torch/utils/data/dataloader.py(792)_try_get_data()
    791                 pids_str = ', '.join(str(w.pid) for w in failed_workers)
--> 792                 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
    793             if isinstance(e, queue.Empty)

Cheers,
Victor

Could you post the output of python -m torch.utils.collect_env and update to the latest release, if not already done?

Thanks for following up. I’m sorry for the late reply. This is the info about my environment.

I’m in the process of creating a new environment with the latest Pytorch

Collecting environment information...
PyTorch version: 1.6.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.5 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.10.2

Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: 10.1.168
GPU models and configuration: 
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti

Nvidia driver version: 460.80
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.0

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] pytorch-lightning==0.9.0
[pip3] torch==1.6.0
[pip3] torchvision==0.7.0
[conda] blas                      2.21                        mkl    conda-forge
[conda] cudatoolkit               10.1.243             h036e899_6    conda-forge
[conda] libblas                   3.8.0                    21_mkl    conda-forge
[conda] libcblas                  3.8.0                    21_mkl    conda-forge
[conda] liblapack                 3.8.0                    21_mkl    conda-forge
[conda] liblapacke                3.8.0                    21_mkl    conda-forge
[conda] mkl                       2020.4             h726a3e6_304    conda-forge
[conda] numpy                     1.19.1           py38hbc27379_2    conda-forge
[conda] pytorch                   1.6.0           py3.8_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] pytorch-lightning         0.9.0                      py_0    conda-forge
[conda] torchvision               0.7.0                py38_cu101    pytorch

Thanks for the update! Could you update PyTorch to the latest or the nightly release and check, if you would still see the same issues?

Thanks!

OK, after trying multiple setups, I managed to find a combination using pytorch=1.9 without conflicts. I will report tomorrow.

BTW, Is it fine that the cuDNN version does not match the one used for the Pytorch binaries?

cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.0
[conda] pytorch                   1.6.0           py3.8_cuda10.1.243_cudnn7.6.3_0    pytorch

I did not have cudnn in the conda environment. I added that, and I will test a new environment without that discrepancy.

Your local CUDA toolkit and cuDNN won’t be used in the binaries, as they ship with their own libs, unless you build PyTorch from source or a custom CUDA extension.

Gotcha.

Another observation. The original environment runs successfully when I use num_workers=0 after 2 days. I will launch a couple of trials more in other machines where I’m getting the same issue.

  1. After updating to Pytorch 1.9.0, the code keeps breaking. The error message made me realize that I was setting the wrong variable CUDA_LAUNCH_BLOCK. With CUDA_LAUNCH_BLOCKING=1, the code hang up completely. It did not progress while I was sleeping :sweat_smile:

    Does anything come to your mind?

Collecting environment information...                                       
PyTorch version: 1.9.0                                           
Is debug build: False                                            
CUDA used to build PyTorch: 11.1                                 
ROCM used to build PyTorch: N/A                                  
                                                                 
OS: Ubuntu 18.04.5 LTS (x86_64)                                  
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0                                        
Clang version: Could not collect                                         
CMake version: version 3.10.2
Libc version: glibc-2.27

Python version: 3.8 (64-bit runtime)
Python platform: Linux-4.15.0-147-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 10.1.168
GPU models and configuration: 
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti

Nvidia driver version: 460.80
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] pytorch-lightning==0.9.0
[pip3] torch==1.9.0
[pip3] torchvision==0.10.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.3.0            py38h54f3939_0  
[conda] mkl_random                1.1.1            py38h0573a6f_0  
[conda] numpy                     1.19.1           py38hbc911f0_0  
[conda] numpy-base                1.19.1           py38hfa32c7d_0  
[conda] pytorch                   1.9.0           py3.8_cuda11.1_cudnn8.0.5_0    pytorch
[conda] pytorch-lightning         0.9.0                    pypi_0    pypi
[conda] torchvision               0.10.0               py38_cu111    pytorch
  1. Side question, if I install with pip. Which dependencies are linked from the system?
    The code environment that I’m using mixes conda and pip requirements. conda itself recommends to be careful about that afair. Thus, I’m considering to just realize on pip and use conda as shelter.
  1. In that case, could you post a minimal, executable code snippet to reproduce this issue?
  2. I don’t know exactly which libraries are loaded via e.g. dlopen, but the CUDA libs will be linked either statically in the pip wheels or dynamically as e.g. the cudatoolkit in the conda binaries.

Thanks for following up. I’m still debugging, but it seems that the error is related to Pytorch Lightning (PL) steroids for dataloaders.

  • Setting to reload_dataloaders_every_epoch=True of PL.Trainer made it for some of my models (or machines? :thinking: ). The sys-admin hasn’t come back with the details of provisioning used on the machines that run with that flag. In my machine, that hack does not work.
  • So far, num_workers=0 is a more consistent hack, but unpractical.
    Perhaps, it’s an issue related to shared memory dunno. The code uses an ancient version of PL.

I feel that it makes more sense that I upgrade the project to the last versions of Pytorch and PL. I will post an example if it keeps breaking.

After a painful debugging. The preliminary takeaways

  • Patience is crucial.
  • Whereas possible try multiple things in parallel. Namely, let the training loop run with num_workers=0 and different values for a long period of time. If the error disappears, then it might not be the logic of your custom dataset.
  • Cross-check the version of the packages such that you know where you stand.
  • Try CUDA_LAUNCH_BLOCKING=1 to isolate that it’s not a silly error.
  • To isolate that error is not in your custom dataset. Record the instances given to the dataset. Then, poke your dataset manually. If no error is triggered, the issue could be on the parallelism logic.

Last report as I found a reliable hack for running experiments within a decent time-frame :slight_smile:

  • Set num_workers=0 in validation loader
  • Set reload_dataloaders_every_epoch=True (PL argument)

Cheers :beers:

Acknowledgements: @ptrblck . Brais & Lukasz for listening and coming out with suggestions. Fuwen for suggestions and spending some time googling the error, and coming up with something non-trivial :slight_smile: