I am trying to train n-models. Each model has the same structure, same inputs, but is learning a different output. Since the models are distinct from one another, I would like to parallelize them. My attempt so far, is putting models on different GPUs. This is done through:
def try_gpu(i=0): #@save
"""Return gpu(i) if exists, otherwise return cpu()."""
if torch.cuda.device_count() >= i + 1:
return torch.device(f'cuda:{i}')
return torch.device('cpu')
model = [LinearNet(net_num).double().to(try_gpu(i%num_gpu) for i in range(net_num)]
So for example, if I have 6 models and num_gpu = 2, then model[0] is on cuda:0, model[1] is on cuda:1, etc.
I have also made sure to put the inputs onto the different GPUs as well. However, I’m stuck at actually making this parallel - I’m still just looping over models and running them sequentially. How would I actually run model[0] and model[1] in parallel?
In your case I’d propose to give a try to Hydra parallel multirun (Joblib Launcher plugin | Hydra) or simply to parameterize your script with command line arguments and launch multiple Python script instances using Bash scripting
I’m trying to avoid doing multiple script launches, because I’m trying to avoid writing data and then reading it back in. Currently, I have another script that creates data and then that data is immediately fed into the neural network script. If I do multiple launches with Bash, then I either have to keep rerunning the data making script, or keep reading in the data, neither of which seems optimal. As things scale up in terms of data and number of networks being trained, it would become slower and slower to read it in and to recalculate it, hence why I’m trying to avoid that.
If I decouple the data creation script from the model, then I must write all of that data somewhere. Then, every time I launch a new model, I will have to read in all of this data. Once our data reaches the millions, this will become rather inefficient, so I’d rather keep them coupled so that the data is just passed to the model script, and then the model script itself is parallelized to run the n models on different GPUs.
In that case, you probably should look at in memory datasets and Joblib launcher of model training in parallel. Once you created in memory dataset you can instantiate N dataloaders that is pased to a N different models on start of the job
Your workflow can launch the model execution on different devices and they will be executed asynchronously unless you are synchronizing explicitly (via torch.cuda.synchronize()) ir implicitly e.g. via tensor.nonzero(), tensor.item() etc.
If you are not seeing overlapping execution in e.g. Nsight Systems, your CPU (and the overall dispatching) might be too slow compared to the GPU execution time.
The CPU will sequentially launch the model execution, but as long as no model synchronizes internally, both would be executed in parallel on both GPUs.
You could verify it with a profiler as mentioned before. If you do this and don’t see an overlap, your training might be CPU bound and the CPU might not be fast enough in the work scheduling compared to the runtime of the models on the GPU.