Train multiple models from the same dataloader on a single device?

ldorigo · June 2, 2022, 4:08pm

Hi,

I have a model that in itself takes relatively little GPU memory (around 1Gb, 1/8th of what I have available), however the dataset is comparatively large (around 40GB, which fits in RAM). Since loading the data takes a long time, I load data lazily and keep everything in RAM - meaning the first epoch takes around 30 minutes to run, but every epoch after that takes less than one minute.

I’m tuning hyperparameters and thus need to train many iterations of the model. My GPU could easily train 7/8 models in parallel, however since I only have 96GB of ram, I can only fit the dataset twice in memory, meaning I can only train two models at a time.

Would it be possible/how can I use a single Dataset/Dataloader for multiple models so that I don’t need to have duplicate data in memory?

Thanks!

ptrblck · June 2, 2022, 11:40pm

It should be possible if you run the different models in different streams.
However, not only the memory limits this use case but you would also need free compute resources to be able to execute workloads in parallel on the same device. I.e. even if a single model is using 1/8 of the available memory it could still use all compute resources (SMs) and thus block the other model so you would need to profile the actual workload.