Configuring nodes - DDP/RPC for Data Parallel Training

theflyingrahul · November 11, 2021, 9:38pm

I’m trying to reuse the servers in my university for Data Parallel Training (they’re hadoop nodes, no GPU, but the CPUs and memory are capable). I referred to PyTorch Distributed Overview — PyTorch Tutorials 1.10.0+cu102 documentation which seems to be super high level, can barely get a thing. Should I use DDP/RPC? Any ideas on how/where to get started?

I went through the example in examples/README.md at master · pytorch/examples · GitHub ; still not very clear!

I understand that I should use gloo as the backend, but how do I configure the slave nodes?
I understand that process with rank 0 is the master, but where do I provide the slave IPs? How to get the slave(s) to listen to the master node for new jobs? How do I start the slave processes?

All of them are CentOS 7 servers, 12 cores/64GB per node, got about 12 of these.

Thanks in advance!

pritamdamania87 · November 15, 2021, 10:17pm

If you want to do Data Parallel Training, you should use DDP and this tutorial should help you: Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.1.1+cu121 documentation.

I understand that I should use gloo as the backend, but how do I configure the slave nodes?
I understand that process with rank 0 is the master, but where do I provide the slave IPs? How to get the slave(s) to listen to the master node for new jobs? How do I start the slave processes?

I think you probably need to familiarize yourself with ProcessGroups and our ProcessGroup API before looking into DDP. This tutorial would give a good overview: Writing Distributed Applications with PyTorch — PyTorch Tutorials 2.1.1+cu121 documentation

Essentially to run DDP, first you need spawn N processes and these processes have ranks [0, N). Now for all of these processes to talk to each other, first you need to initialize a ProcessGroup (think of this as a communication channel you are initializing across all processes) using init_process_group.

Regarding initialization you can refer to this section of our docs (TCP initialization): Distributed communication package - torch.distributed — PyTorch 2.1 documentation. You only need to specify the IP of the master on all the ranks/processes. All processes which are not rank 0 will try to connect to that IP and discover all of the other peers.

So essentially you start N processes and specify a master addr on all, then all processes discover each other via that master address. After that point these processes form a ProcessGroup and you can run collective operations as you wish across the entire process group.

theflyingrahul · November 16, 2021, 8:58am

Thank you for your inputs, that was really helpful!
Got the nodes talking to each other.

As I’ve mentioned, I don’t have GPUs installed on these machines. In the ToyModel example, how do I alter these two lines to move the processes to the CPUs of the respective nodes?

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    # create model and move it to GPU (need to move to CPU) with id rank
    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

Keeps throwing an error that “NVIDIA driver is missing”, I do not intend to use CUDA.

pritamdamania87 · November 17, 2021, 12:05am

You can just do this:

model = ToyModel()
ddp_model = DDP(model)

theflyingrahul · November 17, 2021, 8:04am

Thank you, Pritam.
That worked, marking it as solution now.