Distributed processing

Hello, I am trying to train a machine learning model in python by using the package PyTorch.
By now, I have used a single machine with one GPU, but it is not enough to train the model in reasonable time.
I have got acces to myriad, so that I can try to parallelize the process and having the power of a supercomputer.
To parallelize the training I have been using the module “torch.multiprocessing”, where I load the dataset on the Object “DataLoader”
and I use the method Process().
Here I am a bit confused, because it looks like there are two possible parallelizations to perform:

  1. I can pass the number of threads in the Dataloader class passing the “num_workers” as input
  2. I can append multiple processes with the method Process() and join them after the job.
    Given that my goal is to use multiple GPUs with multiple cores, are 1) and 2) the right ways? Or these parallelization are still related to the CPUs?
    Should I see distributed processing isntead?
    I have this doubt because I have not seen any speed improvement by now.
    It would be really appreciated any kind of help, thanks!!

I don’t know the specs of your new system, but one possible and widely used approach would be to apply a data parallel training strategy via DistributedDataParallel with uses a single process per GPU.
This tutorial might be a good starter.