Different memory consumption on GPU when training models in parallel

neilkimn · April 18, 2023, 8:41am

Hi,

I am experiencing strange GPU memory consumption by my training processes when training multiple of the same model in parallel. For instance, if I am training three ResNet18 models in parallel with a batch size of 50, the memory consumption for each training process is reported as follows, by nvidia-smi:

GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 51077, 1840 MiB
GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 51075, 1840 MiB
GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 51076, 1840 MiB

If I instead increase the batch size past 70 to, for instance, 80, this is what nvidia-smi shows:

GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 23441, 2428 MiB
GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 23443, 2172 MiB
GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 23442, 2428 MiB

I am wondering what causes this uneven utilization of GPU memory. I could imagine it would have something to do with maybe the GPU memory page size or memory swapping on GPU perhaps? It only happens once I surpass a certain threshold for the batch size - for my data that turns out to be a batch size of 70. I am running torch 1.13.1+cu117, CUDA 11.7 and NVIDIA driver 530.30.2.

Perhaps this issue could be resolved by running my desktop machine in headless mode, SSH to it and try again, since I do see a bunch of GPU processes consuming ~750MB of memory of the 8GB memory on my RTX 2080. Maybe my machine is trying to accomodate all of those processes + the training processes at once instead of throwing an OOM exception?

neilkimn · April 18, 2023, 9:33am

Running my machine headless, I can verify that the differences is due to the graphics processes on the GPU. It makes sense, although I am surprised that rather than strictly requiring the same memory for each process and OOM’ing if not possible, some memory swapping appears to be happening. I suppose this behavior is beyond PyTorch but an interesting observation to make, nonetheless.