Something weird happened: My code get killed after iterations.
The error message is “killed”
I understand that there are many discussions about it, however, their solutions can’t apply to my case.
- It is not CUDA OOM.
- I use
cpu = psutil.cpu_percent()
mem = psutil.virtual_memory().available * 100 / psutil.virtual_memory().total
to check my memory usage, however, the percentage of available memory is always 60% until it is killed.
- This machine is used by multiple people whose code and conda environments are definitely different (our code worked before), however, we all receive the same errors.
- Except python code being killed, the machine can work smoothly without problems.
The error log from dmesg is “Out of memory: Killed process 54477 (python) total-vm:61672084kB, anon-rss:32929660kB, file-rss:66904kB, shmem-rss:16228kB, UID:1011 pgtables:72632kB oom_score_adj:0”
I doubt there is problems with cuda or torch, however, since the machine is using by multiple people, it is hard to believe we all install the same conflicting package.
Appreciate all help or advice!!
I reinstall cuda 11.8 and corrsponding torch but there is no luck.