Unable to exit the deep learning program and free GPU memory

wangyurun · August 16, 2024, 12:52pm

Hello. My deep learning program runs on a ubuntu server with multiple graphics cards. Instead of specifying the graphics card by CUDA_VISIBLE_DEVICES, I specify a specific number when creating tensors in the program. This leads to the fact that although the program mainly uses the specified graphics card, a small part of the memory of the remaining graphics cards is also occupied by this program.Today, when the program was running, a graphics card was broken for some reason. This graphics card is not my tensor storage graphics card. As a result, the program could not be closed at the end, and the graphics memory of the graphics card I used could not be released. I try to use kill command to end the program, but it is useless. Is there a solution other than reboot?

ptrblck · August 16, 2024, 12:56pm

I don’t fully understand why a reboot would be problematic based on:

which sounds as if you want to shut down the workstation in any case and replace or at least remove the broken device?

wangyurun · August 16, 2024, 1:02pm

Because other graphics cards can still be used, and someone is still using the server to run their program.

wangyurun · August 16, 2024, 1:12pm

Because someone else is using other graphics cards. Reboot is the final solution.