Weird cuda oom: missing memory

hi,
any explanation to this error?
it happens during validation.
where did the 31gb go?

RuntimeError: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 31.75 GiB total capacity; 394.86 MiB already allocated; 53.00 MiB free; 424.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

pytorch 1.10.0 installation:
pip install torch==1.10.0 -f https://download.pytorch.org/whl/cu111/torch-1.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl

thanks

Could you check nvidia-smi and see if other processes are using the device and are thus allocating memory?

it cant be the case unless there is an issue with slurm that failed to clear previous job, i think.
the job runs on a cluster managed by slurm.
when asked for a gpu, only free gpus are allocated. once allocated, only that job is allowed to run on that gpu.

i’ll add nvidia-smi to the code to run at the beginning of the code and store it. hopefully to catch this error when it happens again.
this type of error does not happen frequently, but i got 4 jobs in a row failed because of this.

i had the same issue once. admins said that it could be a driver bug. i didnt get the follow up of their investigation.
this error happens only on clusters with slurm.
never happens on simple machine.

thanks