Different Python scripts on distinct GPUs generate out of memory


I am facing a problem when executing two Python scripts (same script but different arguments) on different GPUs.

Basically, what happens is that when I run the two scripts, I got a “RuntimeError: CUDA error: out of memory” message. However, if I run only a single instance, the script works.

Please, could someone help me with this problem?

Your scripts might initialize a CUDA context on the default device, which might be unnecessary.
Try to run the scripts via:

CUDA_VISIBLE_DEVICES="0" python script1.py args
CUDA_VISIBLE_DEVICES="1" python script2.py args

and use cuda:0 inside the scripts.

1 Like

Thanks, it worked!

However, when I was running only one script, it took 8 seconds per epoch. Now that I am executing two processes, an epoch is taking 12 seconds.

Do you know why?

Your CPU workload might be the bottleneck, e.g. if both processes need to load and process data before feeding it into the model. To verify it you could profile these processes in isolation and while both are running and compare the time lines e.g. via Nsight Systems.

Thanks again for the answer!

I will check the processes via Nsight Systems. I have one more question: Could this process affect each other (for example, while computing the derivatives)?

I’m not sure what “affect” each other means in this context, so could you explain it a bit more?
If you are concerned about some numerical issues in one process caused by the other training run, then I would claim it should not happen, otherwise you would also be concerned about e.g. firefox running in another process changing your training run.

1 Like

Yes, I was concerned about other training processes. However, I agree with your answer.

About the Nsight System, I could not check it yet. My server has two GPUs, and I have to schedule a job to run a process. However, if I use these two GPUs to train the model, it is not possible to execute another job for the Nsight System.

However, I am processing the data before feeding the model. If this process executed by the CPU is the bottleneck, Is there something I can do (e.g., increasing the number of workers in the DataLoader)?

You could try to increase the number of workers, but note that it could also slow down your code in case too many processes are open or if the bottleneck is created by reading the data from the SSD (as more workers cannot speed up the bandwidth).