I am facing a problem when executing two Python scripts (same script but different arguments) on different GPUs.
Basically, what happens is that when I run the two scripts, I got a “RuntimeError: CUDA error: out of memory” message. However, if I run only a single instance, the script works.
Your CPU workload might be the bottleneck, e.g. if both processes need to load and process data before feeding it into the model. To verify it you could profile these processes in isolation and while both are running and compare the time lines e.g. via Nsight Systems.
I will check the processes via Nsight Systems. I have one more question: Could this process affect each other (for example, while computing the derivatives)?
I’m not sure what “affect” each other means in this context, so could you explain it a bit more?
If you are concerned about some numerical issues in one process caused by the other training run, then I would claim it should not happen, otherwise you would also be concerned about e.g. firefox running in another process changing your training run.
Yes, I was concerned about other training processes. However, I agree with your answer.
About the Nsight System, I could not check it yet. My server has two GPUs, and I have to schedule a job to run a process. However, if I use these two GPUs to train the model, it is not possible to execute another job for the Nsight System.
However, I am processing the data before feeding the model. If this process executed by the CPU is the bottleneck, Is there something I can do (e.g., increasing the number of workers in the DataLoader)?
You could try to increase the number of workers, but note that it could also slow down your code in case too many processes are open or if the bottleneck is created by reading the data from the SSD (as more workers cannot speed up the bandwidth).