DDP Tutorial Questions

I am attempting to follow this tutorial to train a neural network over multiple GPUs on the same windows machine (using gloo), and do not fully understand the reasons why the example code is structured in the way presented. Does anyone know any other good tutorials on how to do this? I’m trying to utilize all my GPUs on my computer and do not fully understand how to write a full training and testing loop using DPP.

I’ve read several tutorials, but still have a few questions:

Is it possible to use DPP in a jupyter notebook?

why do all the functions take “rank” as an argument, but never define this value anywhere?

If you have your own boilerplate method for training a neural network in a loop how do you incorporate these techniques? do you only parallelize the model and let the computer determine where to send it, or do you need to parallelize everything as outlined in this tutorial?

How would you “setup” your environment to access GPUs on the same computer? How do you know what port to use? Also, how is the “init_method” file used in windows computers? How would you pick one?

Even though Pytorch does not recommend using data parallel over DPP, is there a way to use it with a gloo backend? It seems easier to implement than the DPP method, and for that reason it might be easier to use for newbies.

Thank you,
Joe

I assume you are referring to DDP (DistributedDataParallel) instead DPP (Data-Preprocessing Pipeline) in this post.

Is it possible to use DPP in a jupyter notebook?

I wouldn’t say it’s impossible, but could be tricky, as notebook is not very compatible with multi-processing.

why do all the functions take “rank” as an argument, but never define this value anywhere?

That’s torch.multiprocessing.spawn’s default behavior. It will pass the rank automatically to the target function as the first argument. See the API doc below.

https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn

If you have your own boilerplate method for training a neural network in a loop how do you incorporate these techniques? do you only parallelize the model and let the computer determine where to send it, or do you need to parallelize everything as outlined in this tutorial?

DDP operates on the process level (see the minimum DDP example: Distributed Data Parallel — PyTorch 2.1 documentation). It does not care how you launch those processes, or where those processes locate. In your code, you just need to replace your local model with DDP(model, ...), and then it will take care of gradient synchronization for you.

Regarding how to launch the processes, you can use any tools you prefer. For single-machine multi-GPU, mp.spawn() is sufficient. For multiple machines, many people directly use Slurm or use torchrun (torchrun (Elastic Launch) — PyTorch 2.1 documentation).

How would you “setup” your environment to access GPUs on the same computer? How do you know what port to use?

You just need to specify the master address and master port. All ranks will then talk to that master during rendezvous to discover each other (i.e., IPs and ports). After that all processes can directly talk to each other.

Also, how is the “init_method” file used in windows computers? How would you pick one?

You can use a file on an NFS as the init method. As long as all processes can access that file, it should work. I believe the TCP init method also works on Windows today. Let us know if you hit issues there. cc @H-Huang

Even though Pytorch does not recommend using data parallel over DPP, is there a way to use it with a gloo backend?

DataParallel (DP) is single-process multi-thread, and is slower than DDP. If you are using single-machine multi-GPU and are OK with DP’s perf, yep, DP is indeed an easier entry point. It does not need Gloo or NCCL ProcessGroup at all, as there is only one process and it can directly access all Tensors.

1 Like