Efficient way to run independent jobs on multiple GPU machine

Hello, it is unclear to me what is the efficient way to run independent jobs (e.g., the many multiple runs of a hyper-parameter search effort) on a machine with multiple GPUs. I am not sure how Pytorch handles multiple GPUs, but I can see three ways with each possibly being better depending on how multiple GPUs are handled:

  1. Run the jobs one by one serially on the machine
  2. Run as many jobs in parallel on the machine, limiting each to a different GPU with CUDA_VISIBLE_DEVICES
  3. Run as many jobs in parallel on the machine, without limiting runs to specific GPUs
    Which of these options would be best? Or perhaps something else entirely?
1 Like

Hi, does anyone have some ideas?

Option 1 seems bad because you’re not exploiting the potential parallelism.

Option 3 is interesting. I’m not sure how pytorch picks which GPU to use if you don’t specify but I wouldn’t be surprised if everything ran on GPU 0 or something.

Option 2 (using CUDA_VISIBLE_DEVICES) looks good to me. Pytorch has this setDevice API where you can tell it what gpu to put each tensor on, but this code seems to recommend just using CUDA_VISIBLE_DEVICES

1 Like

I tried option 3 with multiprocessing.Pool() + pool.apply_async() + gpu_list = multiprocessing.Manager().list([0]*10). I tried to use a list to record the GPU usages, e.g. one gpu at most runs two jobs. Add 1 on the gpu_list when starting the job and subtract 1 after finishing it. I succeeded after solving three major problems:

1, At the start of each job, they will read gpu_lock at the same time. More than 2 jobs may use be allocated to the same GPU.
SOLUTION: use time.sleep(0.1) while submitting jobs to by pool.apply_async()
Or BETTER SOLUTION: use multiprocessing.Lock() to protect the gpu_list, add a lock when decreasing or adding gpu_list.

2, the pool would start the new processing immediately after finishing the old one, but the old GPU memory release is much slower, causing out-of-memory errors for the new process.
SOLUTION: use time.sleep(30) to wait and then read gpu_list to select spare GPU.
Or BETTER SOLUTION: use del network + torch.cuda.empty_cache() at the end of the process function to manually release gpu memory.

3, In my codes, it’s strange that I can run three jobs (3 GB each) by different main() at the same without any error. However, if I use multiprocessing, some processing will die due to out-of-memory. I check GPU usage and find that my jobs would suddenly use 10 GB and lead to the other two’s failure. But this situation never happens when I run three scripts separately. I still don’t know why.

Anyway, it’s very tricky to use multiprocessing to wrap GPU jobs. The errors are strange and most of time, there is no error at all (You need use pool.apply without recall function to debug). The pool would finish its job even some of them fail. By the way, don’t use torch.cuda.is_available or similar even in the main(), strange error occurs.

1 Like

Thank you for such a detailed report! Saved me hrs of debugging.