CUDA error: Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:763

I’ve been running the same code for several weeks without any problems, but just the other day I started getting this error. Training proceeds just fine for several thousand iterations, and then a CUDA error is raised. It seems to happen randomly. I have no idea what the problem could be. Here is the error message:

terminate called after throwing an instance of 'c10::Error'                                                                                                   
  what():  CUDA error: initialization error                                                                                                                   
Exception raised from insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:763 (most recent call first):                                               
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fa58ba762f2 in /home/catalys1/venv/lib/python3.9/site-packages/torch/lib/libc10.so)  
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fa58ba7367b in /home/catalys1/venv/lib/python3.9/
site-packages/torch/lib/libc10.so)                                                                                                                            
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc92 (0x7fa58bcce682 in /home/catalys1/venv/lib/python3.9/site-packages/torch/lib/libc10_cuda.
so)                                                                                                                                                           
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fa58ba5e3a4 in /home/catalys1/venv/lib/python3.9/site-packages/torch/lib/libc10.so)                 
frame #4: <unknown function> + 0x6e415a (0x7fa5de63915a in /home/catalys1/venv/lib/python3.9/site-packages/torch/lib/libtorch_python.so)                      
frame #5: <unknown function> + 0x233ea2 (0x55f73a3bcea2 in /home/catalys1/venv/bin/python)                                                                    
frame #6: <unknown function> + 0x23383e (0x55f73a3bc83e in /home/catalys1/venv/bin/python)                                                                    
frame #7: _PyObject_GC_New + 0xaa (0x55f73a3407ca in /home/catalys1/venv/bin/python)                                                                          
frame #8: PyMethod_New + 0x25 (0x55f73a35cb75 in /home/catalys1/venv/bin/python)                                                                              
frame #9: <unknown function> + 0x160423 (0x55f73a2e9423 in /home/catalys1/venv/bin/python)                                                                    
frame #10: _PyObject_GetMethod + 0x10b (0x55f73a2d79cb in /home/catalys1/venv/bin/python)                                                                     
frame #11: _PyEval_EvalFrameDefault + 0x541 (0x55f73a313bc1 in /home/catalys1/venv/bin/python)                                                                
frame #12: <unknown function> + 0x189ebf (0x55f73a312ebf in /home/catalys1/venv/bin/python)                                                                   
frame #13: _PyObject_Call_Prepend + 0x46f (0x55f73a2abc7f in /home/catalys1/venv/bin/python)                                                                  
frame #14: <unknown function> + 0x160aba (0x55f73a2e9aba in /home/catalys1/venv/bin/python)
frame #15: <unknown function> + 0x15db61 (0x55f73a2e6b61 in /home/catalys1/venv/bin/python)
frame #16: <unknown function> + 0x1d7785 (0x55f73a360785 in /home/catalys1/venv/bin/python)    
frame #17: PyObject_Call + 0x22c (0x55f73a2ac40c in /home/catalys1/venv/bin/python)        
frame #18: _PyEval_EvalFrameDefault + 0x2f9a (0x55f73a31661a in /home/catalys1/venv/bin/python)
frame #19: <unknown function> + 0x189ac1 (0x55f73a312ac1 in /home/catalys1/venv/bin/python)
frame #20: _PyObject_Call_Prepend + 0x46f (0x55f73a2abc7f in /home/catalys1/venv/bin/python)
frame #21: <unknown function> + 0x209eb9 (0x55f73a392eb9 in /home/catalys1/venv/bin/python)    
frame #22: _PyObject_MakeTpCall + 0x7e (0x55f73a2aa76e in /home/catalys1/venv/bin/python)  
frame #23: _PyEval_EvalFrameDefault + 0x51d3 (0x55f73a318853 in /home/catalys1/venv/bin/python)
frame #24: <unknown function> + 0x18a068 (0x55f73a313068 in /home/catalys1/venv/bin/python)   
frame #25: _PyFunction_Vectorcall + 0x19d (0x55f73a2ab0ed in /home/catalys1/venv/bin/python)  
frame #26: _PyEval_EvalFrameDefault + 0x3e9 (0x55f73a313a69 in /home/catalys1/venv/bin/python)
frame #27: _PyEval_EvalCodeWithName + 0x252 (0x55f73a3120a2 in /home/catalys1/venv/bin/python)
frame #28: PyEval_EvalCode + 0x27 (0x55f73a3a3147 in /home/catalys1/venv/bin/python)       
frame #29: <unknown function> + 0x26fd82 (0x55f73a3f8d82 in /home/catalys1/venv/bin/python)
frame #30: <unknown function> + 0x1d9a03 (0x55f73a362a03 in /home/catalys1/venv/bin/python)    
frame #31: PyObject_Call + 0x1d2 (0x55f73a2ac3b2 in /home/catalys1/venv/bin/python)        
frame #32: _PyEval_EvalFrameDefault + 0x5c8e (0x55f73a31930e in /home/catalys1/venv/bin/python)
frame #33: <unknown function> + 0x189ac1 (0x55f73a312ac1 in /home/catalys1/venv/bin/python)    
frame #34: _PyFunction_Vectorcall + 0x19d (0x55f73a2ab0ed in /home/catalys1/venv/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x4c3d (0x55f73a3182bd in /home/catalys1/venv/bin/python)
frame #36: _PyFunction_Vectorcall + 0x103 (0x55f73a2ab053 in /home/catalys1/venv/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x663 (0x55f73a313ce3 in /home/catalys1/venv/bin/python)
frame #38: _PyFunction_Vectorcall + 0x103 (0x55f73a2ab053 in /home/catalys1/venv/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x3e9 (0x55f73a313a69 in /home/catalys1/venv/bin/python)
frame #40: _PyFunction_Vectorcall + 0x103 (0x55f73a2ab053 in /home/catalys1/venv/bin/python)                                                                  
frame #41: _PyEval_EvalFrameDefault + 0x3e9 (0x55f73a313a69 in /home/catalys1/venv/bin/python)                                                                
frame #42: _PyFunction_Vectorcall + 0x103 (0x55f73a2ab053 in /home/catalys1/venv/bin/python)       
frame #43: <unknown function> + 0x121bed (0x55f73a2aabed in /home/catalys1/venv/bin/python)           
frame #44: _PyObject_CallMethodIdObjArgs + 0x135 (0x55f73a2ac885 in /home/catalys1/venv/bin/python)
frame #45: PyImport_ImportModuleLevelObject + 0x3da (0x55f73a32bc7a in /home/catalys1/venv/bin/python)
frame #46: <unknown function> + 0x1de1fc (0x55f73a3671fc in /home/catalys1/venv/bin/python)
frame #47: <unknown function> + 0x1d9b1b (0x55f73a362b1b in /home/catalys1/venv/bin/python)
frame #48: PyObject_Call + 0x22c (0x55f73a2ac40c in /home/catalys1/venv/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x5c8e (0x55f73a31930e in /home/catalys1/venv/bin/python)
frame #50: <unknown function> + 0x189ac1 (0x55f73a312ac1 in /home/catalys1/venv/bin/python)
frame #51: _PyFunction_Vectorcall + 0x19d (0x55f73a2ab0ed in /home/catalys1/venv/bin/python)
frame #52: _PyEval_EvalFrameDefault + 0x3e9 (0x55f73a313a69 in /home/catalys1/venv/bin/python)
frame #53: <unknown function> + 0x18a068 (0x55f73a313068 in /home/catalys1/venv/bin/python)
frame #54: _PyFunction_Vectorcall + 0x19d (0x55f73a2ab0ed in /home/catalys1/venv/bin/python)
frame #55: <unknown function> + 0x121bed (0x55f73a2aabed in /home/catalys1/venv/bin/python)
frame #56: _PyObject_CallMethodIdObjArgs + 0x135 (0x55f73a2ac885 in /home/catalys1/venv/bin/python)
frame #57: PyImport_ImportModuleLevelObject + 0x46e (0x55f73a32bd0e in /home/catalys1/venv/bin/python)
frame #58: _PyEval_EvalFrameDefault + 0x3294 (0x55f73a316914 in /home/catalys1/venv/bin/python)
frame #59: _PyFunction_Vectorcall + 0x103 (0x55f73a2ab053 in /home/catalys1/venv/bin/python)
frame #60: _PyEval_EvalFrameDefault + 0x663 (0x55f73a313ce3 in /home/catalys1/venv/bin/python)
frame #61: _PyFunction_Vectorcall + 0x103 (0x55f73a2ab053 in /home/catalys1/venv/bin/python)
frame #62: _PyEval_EvalFrameDefault + 0x663 (0x55f73a313ce3 in /home/catalys1/venv/bin/python)
frame #63: <unknown function> + 0x18a068 (0x55f73a313068 in /home/catalys1/venv/bin/python)

Does anyone have any idea what could be causing this issue, and how I can fix it? Any help would be much appreciated. Thank you!

An initialization error could be raised e.g. if you are trying to create multiple CUDA context objects.
In your case it seems as if a delete method might be running into it. Could you check dmesg for xid error messages and see, if e.g. the driver was crashing etc.?

So I’m still running into this. For additional context, I’m using PytorchLightning with distributed data parallel on 6 GPUs. Training proceeds normally for a few epochs, and then crashes with the error given above. I also just verified that the same things happens when using a single GPU.

@ptrblck I’m not sure what I’m looking for in dmesg. dmesg | grep xid comes up empty.

Try to narrow down the issue on a single device by checking if you are able to reproduce the issue deterministically. If so, check if any additional CUDA related work is done in this iteration, e.g. using CUDATensors in the Dataset. Also, scale down the use case as much as possible by e.g. setting the number of workers to 0 etc.

I decided to try updating pytorch to 1.9 (was 1.8), and it seems like that fixed it…

I now have this exact same problem, but I’m running pytorch 1.10.0+cu102. It always occurs midway through epoch 2 of CIFAR10. Any other ideas?

Edit: The problem comes and goes. It just disappeared with one run and returned 2 runs later…

I opened a bug report: Changing errors when running the same code · Issue #11067 · PyTorchLightning/pytorch-lightning · GitHub

Every time I tried running (even without GPU), I’d get a different error