Is there a way to train independent models in parallel using the same dataloader?

Hi Roman,

I am working on something similar - using the same training data for training multiple (identical) models in parallel. Regarding your question on the ~1GB GPU memory allocation per process, it is most likely due to the CUDA context overhead. If you start an interactive Python instance, import torch and create a tensor and send it to your device, you should see what the overhead of CUDA contexts is in nvidia-smi. For instance, I am hovering around ~5-600MB.

Some pointers on the CUDA context overhead:

Secondly, I wanted to ask whether you had encountered discrepancies in the GPU memory consumption for each training process. For instance, I am training multiple ResNet18 models in parallel. If I use a certain batch size below a threshold, each training instance uses the same amount of memory. Once I surpass this threshold for the batch size, each training instance start using different amounts of GPU memory. Did you notice something similar to this? This is what my nvidia-smi looks like for three ResNet18 models being trained in parallel with a batch size above the threshold:

GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 23441, 2428 MiB
GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 23443, 2172 MiB
GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 23442, 2428 MiB

And if I train the same three models with a smaller batch size, I instead see identical memory consumption for each training process:

GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 51077, 1840 MiB
GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 51075, 1840 MiB
GPU-2022369e-2f16-0362-7dc3-ea36ded90774, 51076, 1840 MiB

Update: Fixed my issue with the memory consumption in another post I made (I can only link to two URLs in a post due to being a new user)

1 Like