Hello. My deep learning program runs on a ubuntu server with multiple graphics cards. Instead of specifying the graphics card by CUDA_VISIBLE_DEVICES, I specify a specific number when creating tensors in the program. This leads to the fact that although the program mainly uses the specified graphics card, a small part of the memory of the remaining graphics cards is also occupied by this program.Today, when the program was running, a graphics card was broken for some reason. This graphics card is not my tensor storage graphics card. As a result, the program could not be closed at the end, and the graphics memory of the graphics card I used could not be released. I try to use kill command to end the program, but it is useless. Is there a solution other than reboot?
I don’t fully understand why a reboot would be problematic based on:
which sounds as if you want to shut down the workstation in any case and replace or at least remove the broken device?
Because other graphics cards can still be used, and someone is still using the server to run their program.
Because someone else is using other graphics cards. Reboot is the final solution.