The running training is being killed when a new one is scheduled on A6000 GPU

edikmkoyan · August 23, 2023, 7:38am

One more thing. Suppose I have a training that may potentially use all the 48 GB of the GPU memory, in such case I will set the torch.cuda.set_per_process_memory_fraction to 1. It starts running knowing that it can allocate all the memory, but it didn’t yet. Another process starts and tries to allocate some significant chunk of memory, which process will be killed?
I also didn’t mentioned that oom killer kills the process although the process does not consume system RAM, it consumes GPU RAM, can oom killer do it’s scoring based on GPU RAM?

Here is the dmesg log for killed trainings.

[1049496.935479] **Out of memory**: Killed process 3718080 (python) total-vm:40401236kB, anon-rss:6162012kB, file-rss:68616kB, shmem-rss:287344kB, UID:1001 pgtables:18492kB oom_score_adj:0

[1049506.723130] **Out of memory**: Killed process 3717888 (python) total-vm:40402784kB, anon-rss:6163596kB, file-rss:68064kB, shmem-rss:313964kB, UID:1001 pgtables:18548kB oom_score_adj:0

[1049519.565244] **Out of memory**: Killed process 3717292 (python) total-vm:40445528kB, anon-rss:6206448kB, file-rss:68480kB, shmem-rss:366624kB, UID:1001 pgtables:18736kB oom_score_adj:0

[1049535.486537] **Out of memory**: Killed process 3718466 (python) total-vm:40458440kB, anon-rss:6219144kB, file-rss:68788kB, shmem-rss:343796kB, UID:1001 pgtables:18716kB oom_score_adj:0