Hello @ptrblck_de, I am facing the same issue.
I am trying to replicate the results from MotionBERT a 3D human pose estimation.
I am running the code in an high performace computer (HPC), where I have multiple nodes with high RAM avaliable.
My sh configuration is:
#!/bin/sh
#BSUB -q gpua100
#BSUB -J motionBERT
#BSUB -W 23:00
#BSUB -B
#BSUB -N
#BSUB -gpu "num=1:mode=exclusive_process"
#BSUB -n 4
#BSUB -R "span[hosts=1]"
#BSUB -R "rusage[mem=4GB]"
#BSUB -o logs/%J.out
#BSUB -e logs/%J.err
module load cuda/11.6
module load gcc/10.3.0-binutils-2.36.1
source /zhome/c0/a/164613/miniconda3/etc/profile.d/conda.sh
conda activate /work3/s212784/conda/env/motionbert
python train.py --config configs/pose3d/MB_train_h36m.yaml --checkpoint checkpoint/pose3d/MB_train_h36m
Which gives me the following resources:
CPU time : 29.94 sec.
Max Memory : 80 MB
Average Memory : 80.00 MB
Total Requested Memory : 16384.00 MB
Delta Memory : 16304.00 MB
Max Swap : 16 MB
Max Processes : 4
Max Threads : 5
Run time : 152 sec.
Turnaround time : 119 sec.
The output error that I am facing is the following:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 79.15 GiB total capacity; 75.23 GiB already allocated; 159.25 MiB free; 77.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have tried to change the max_split_size_mb with the following line in the jobscript.sh (before python run… ) :
export PYTORCH_CUDA_ALLOC_CONF=“max_split_size_mb=512”
Although I am reaching the same problem, I can see that the memory usage, average and mean are increasing:
Resource usage summary:
CPU time : 30.63 sec.
Max Memory : 2057 MB
Average Memory : 1391.67 MB
Total Requested Memory : 16384.00 MB
Delta Memory : 14327.00 MB
Max Swap : 16 MB
Max Processes : 4
Max Threads : 8
Run time : 145 sec.
Turnaround time : 122 sec.
Nevertheless, when increasing the value too much it drops again…
export PYTORCH_CUDA_ALLOC_CONF=“max_split_size_mb=1024”
Resource usage summary:
CPU time : 30.63 sec.
Max Memory : 2057 MB
Average Memory : 1391.67 MB
Total Requested Memory : 16384.00 MB
Delta Memory : 14327.00 MB
Max Swap : 16 MB
Max Processes : 4
Max Threads : 8
Run time : 145 sec.
Turnaround time : 122 sec.
Therefore, I am not sure if this might be the problem…
Aditionally, (I don’t know if relevant) before changing the max_split_size_mb, I thought that it was a problem from resources in the HPC so tried to run it multiple times. One thing that I notieced it’s that the allocated memory was not always the same:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 518.00 MiB (GPU 0; 15.77 GiB total capacity; 13.82 GiB already allocated; 275.06 MiB free; 14.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I am in the right path? Is the fragmentation causing the problem? Is there a way to know the exact number it should be there?
Thank you very much,
Best Regards,
Alex Abades