I have what seems to be a very basic issue but couldn’t find a solution for it.
I have a GPU of 24GB, and I am trying to train two different independent models; each takes 2GB only. When I train the two models simultaneously on that GPU, the training speed on the two models reduces to around (1/5) the speed of training one of them.
You might be creating new bottlenecks in your code e.g. if both applications are trying to load data while your storage isn’t fast enough to feed them.
You could profile the script the see where the bottleneck is created.
Also, to parallelize CUDA workloads on the GPU you would need to make sure enough compute resources are free. E.g. if one script uses all SMs in e.g. a matmul operation, the other kernel won’t be able to run, so you might not expect to see perfect parallelization (even without other bottlenecks) but it of course depends on the actual use case.