I am experiencing strange GPU memory consumption by my training processes when training multiple of the same model in parallel. For instance, if I am training three ResNet18 models in parallel with a batch size of 50, the memory consumption for each training process is reported as follows, by nvidia-smi:
GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 51077, 1840 MiB GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 51075, 1840 MiB GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 51076, 1840 MiB
If I instead increase the batch size past 70 to, for instance, 80, this is what nvidia-smi shows:
GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 23441, 2428 MiB GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 23443, 2172 MiB GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 23442, 2428 MiB
I am wondering what causes this uneven utilization of GPU memory. I could imagine it would have something to do with maybe the GPU memory page size or memory swapping on GPU perhaps? It only happens once I surpass a certain threshold for the batch size - for my data that turns out to be a batch size of 70. I am running torch 1.13.1+cu117, CUDA 11.7 and NVIDIA driver 530.30.2.
Perhaps this issue could be resolved by running my desktop machine in headless mode, SSH to it and try again, since I do see a bunch of GPU processes consuming ~750MB of memory of the 8GB memory on my RTX 2080. Maybe my machine is trying to accomodate all of those processes + the training processes at once instead of throwing an OOM exception?