CUDA error: unspecified launch failure whe running the program for the second time

DawnChou · August 2, 2023, 6:16am

Hi,

I came across the problem

CUDA error: unspecified launch failure

when I tried running the model for a second time in a python script. The model is in the function RunmyModel(), and I need to call the function two times in the python script. However, when the program call the function at the second time, it showed the CUDA error. Specifically, the code ended at

File ~/integration/testmodel.py:160, in testNet.to(self)
158 for name, param in self.dict.items():
159 if name not in [‘k’]: →
160 self.dict[name] = param.cuda()
RuntimeError: CUDA error: unspecified launch failure

It seems the error occurred due to .to() when transferring the model to GPU. Aslo , I noticed that the GPU memory was not free after the first running.

I tried torch.cuda.empty_cache() but it didn’t work. However, when I ran the same script in another machine with the same version of torch and cuda, the error does not occurred even though the GPU memory still was not free after the first running. How to solve the problem?

PS. I checked dmesg, here is the info:

[ 5.474440] [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
[4201635.872634] NVRM: GPU at PCI:0000:0b:00: GPU-cab56c2d-811d-98fb-d3de-52e2bb36782d

ptrblck · August 2, 2023, 11:29am

I’m unsure what the dmesg output should indicate, but are you able to use the GPU at all afterwards? If not your system might be dropping it which should be indicated by XIDs in the dmesg output.