Train multiple independent models on a single GPU

msuliman · June 29, 2022, 8:07am

Hi,

I have what seems to be a very basic issue but couldn’t find a solution for it.

I have a GPU of 24GB, and I am trying to train two different independent models; each takes 2GB only. When I train the two models simultaneously on that GPU, the training speed on the two models reduces to around (1/5) the speed of training one of them.

What could be the issue?

ptrblck · June 29, 2022, 8:21am

You might be creating new bottlenecks in your code e.g. if both applications are trying to load data while your storage isn’t fast enough to feed them.
You could profile the script the see where the bottleneck is created.

Also, to parallelize CUDA workloads on the GPU you would need to make sure enough compute resources are free. E.g. if one script uses all SMs in e.g. a matmul operation, the other kernel won’t be able to run, so you might not expect to see perfect parallelization (even without other bottlenecks) but it of course depends on the actual use case.

blackbirdbarber · June 29, 2022, 8:54am

Maybe you are not using Nvidia gpu, in which case this may be true.

blackbirdbarber · June 29, 2022, 9:05am

What storage you recommend?

ptrblck · June 29, 2022, 6:30pm

I would use an SSD at least and avoid trying to load data from a spinning disk.