My program started to hang up randomly. I prepend the variable CUDA_LAUNCH_BLOCK=1, but it doesn’t show me any error line related to my code. Running with workers=0 takes ~7.25s per iteration (~2 hours per epoch)
Any tips for debugging that. Patience?
what(): CUDA error: initialization error
Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1595629411241/work/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first):frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f83d65fc77d in /home/user/install/bin/miniconda3/envs/epic-models-plus/lib/python3.8/site-packages/torch/lib/libc10.so)frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1130 (0x7f83d684d370 in /home/user/install/bin/miniconda3/envs/epic-models-plus/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f83d65e8b1d in /home/user/install/bin/miniconda3/envs/epic-models-plus/lib/python3.8/site-packages/torch/lib/libc10.so)frame #3: <unknown function> + 0x53956b (0x7f840fe7956b in /home/user/install/bin/miniconda3/envs/epic-models-plus/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
RuntimeError('DataLoader worker (pid(s) 13152) exited unexpectedly')
> /home/user/install/bin/miniconda3/envs/epic-models-plus/lib/python3.8/site-packages/torch/utils/data/dataloader.py(792)_try_get_data()
791 pids_str = ', '.join(str(w.pid) for w in failed_workers)
--> 792 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
793 if isinstance(e, queue.Empty)
Your local CUDA toolkit and cuDNN won’t be used in the binaries, as they ship with their own libs, unless you build PyTorch from source or a custom CUDA extension.
Another observation. The original environment runs successfully when I use num_workers=0 after 2 days. I will launch a couple of trials more in other machines where I’m getting the same issue.
After updating to Pytorch 1.9.0, the code keeps breaking. The error message made me realize that I was setting the wrong variable CUDA_LAUNCH_BLOCK. With CUDA_LAUNCH_BLOCKING=1, the code hang up completely. It did not progress while I was sleeping
Does anything come to your mind?
Collecting environment information...
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Libc version: glibc-2.27
Python version: 3.8 (64-bit runtime)
Python platform: Linux-4.15.0-147-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 10.1.168
GPU models and configuration:
GPU 0: GeForce RTX 2080 Ti
GPU 1: GeForce RTX 2080 Ti
GPU 2: GeForce RTX 2080 Ti
GPU 3: GeForce RTX 2080 Ti
GPU 4: GeForce RTX 2080 Ti
GPU 5: GeForce RTX 2080 Ti
GPU 6: GeForce RTX 2080 Ti
GPU 7: GeForce RTX 2080 Ti
Nvidia driver version: 460.80
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] pytorch-lightning==0.9.0
[pip3] torch==1.9.0
[pip3] torchvision==0.10.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.3.0 py38h54f3939_0
[conda] mkl_random 1.1.1 py38h0573a6f_0
[conda] numpy 1.19.1 py38hbc911f0_0
[conda] numpy-base 1.19.1 py38hfa32c7d_0
[conda] pytorch 1.9.0 py3.8_cuda11.1_cudnn8.0.5_0 pytorch
[conda] pytorch-lightning 0.9.0 pypi_0 pypi
[conda] torchvision 0.10.0 py38_cu111 pytorch
Side question, if I install with pip. Which dependencies are linked from the system?
The code environment that I’m using mixes conda and pip requirements. conda itself recommends to be careful about that afair. Thus, I’m considering to just realize on pip and use conda as shelter.
In that case, could you post a minimal, executable code snippet to reproduce this issue?
I don’t know exactly which libraries are loaded via e.g. dlopen, but the CUDA libs will be linked either statically in the pip wheels or dynamically as e.g. the cudatoolkit in the conda binaries.
Thanks for following up. I’m still debugging, but it seems that the error is related to Pytorch Lightning (PL) steroids for dataloaders.
Setting to reload_dataloaders_every_epoch=True of PL.Trainer made it for some of my models (or machines? ). The sys-admin hasn’t come back with the details of provisioning used on the machines that run with that flag. In my machine, that hack does not work.
So far, num_workers=0 is a more consistent hack, but unpractical.
Perhaps, it’s an issue related to shared memory dunno. The code uses an ancient version of PL.
I feel that it makes more sense that I upgrade the project to the last versions of Pytorch and PL. I will post an example if it keeps breaking.
After a painful debugging. The preliminary takeaways
Patience is crucial.
Whereas possible try multiple things in parallel. Namely, let the training loop run with num_workers=0 and different values for a long period of time. If the error disappears, then it might not be the logic of your custom dataset.
Cross-check the version of the packages such that you know where you stand.
Try CUDA_LAUNCH_BLOCKING=1 to isolate that it’s not a silly error.
To isolate that error is not in your custom dataset. Record the instances given to the dataset. Then, poke your dataset manually. If no error is triggered, the issue could be on the parallelism logic.
Last report as I found a reliable hack for running experiments within a decent time-frame
Set num_workers=0 in validation loader
Set reload_dataloaders_every_epoch=True (PL argument)
Cheers
Acknowledgements: @ptrblck . Brais & Lukasz for listening and coming out with suggestions. Fuwen for suggestions and spending some time googling the error, and coming up with something non-trivial