I’m trying to reuse the servers in my university for Data Parallel Training (they’re hadoop nodes, no GPU, but the CPUs and memory are capable). I referred to PyTorch Distributed Overview — PyTorch Tutorials 1.10.0+cu102 documentation which seems to be super high level, can barely get a thing. Should I use DDP/RPC? Any ideas on how/where to get started?
I went through the example in examples/README.md at master · pytorch/examples · GitHub ; still not very clear!
I understand that I should use gloo as the backend, but how do I configure the slave nodes?
I understand that process with rank 0 is the master, but where do I provide the slave IPs? How to get the slave(s) to listen to the master node for new jobs? How do I start the slave processes?
All of them are CentOS 7 servers, 12 cores/64GB per node, got about 12 of these.
Thanks in advance!
If you want to do Data Parallel Training, you should use DDP and this tutorial should help you: Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.1.1+cu121 documentation.
I understand that I should use gloo as the backend, but how do I configure the slave nodes?
I understand that process with rank 0 is the master, but where do I provide the slave IPs? How to get the slave(s) to listen to the master node for new jobs? How do I start the slave processes?
I think you probably need to familiarize yourself with ProcessGroups and our ProcessGroup API before looking into DDP. This tutorial would give a good overview: Writing Distributed Applications with PyTorch — PyTorch Tutorials 2.1.1+cu121 documentation
Essentially to run DDP, first you need spawn N processes and these processes have ranks [0, N). Now for all of these processes to talk to each other, first you need to initialize a ProcessGroup (think of this as a communication channel you are initializing across all processes) using init_process_group
.
Regarding initialization you can refer to this section of our docs (TCP initialization): Distributed communication package - torch.distributed — PyTorch 2.1 documentation. You only need to specify the IP of the master on all the ranks/processes. All processes which are not rank 0 will try to connect to that IP and discover all of the other peers.
So essentially you start N processes and specify a master addr on all, then all processes discover each other via that master address. After that point these processes form a ProcessGroup and you can run collective operations as you wish across the entire process group.
1 Like
Thank you for your inputs, that was really helpful!
Got the nodes talking to each other.
As I’ve mentioned, I don’t have GPUs installed on these machines. In the ToyModel example, how do I alter these two lines to move the processes to the CPUs of the respective nodes?
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# create model and move it to GPU (need to move to CPU) with id rank
model = ToyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
Keeps throwing an error that “NVIDIA driver is missing”, I do not intend to use CUDA.
You can just do this:
model = ToyModel()
ddp_model = DDP(model)
Thank you, Pritam.
That worked, marking it as solution now.