Distributed Data Parallel

Hello there, I am student and we could say beginner to the topic of machine learning.
I am currently looking into problematic of parallel training on multiple GPUs. I understand DataParallel, but cant make Distributed Data Parallel works.
The part I dont understand is communication through backend and connecting two nodes, for example do they need to be on same cluster? Or is static IP enough for master node. Can I somehow use public IP, and how?

My question really is, if you could provide me some good sources of knowledge, tutorials, etc.

I would be really grateful.


To concretely answer your question around communication, basically nodes need some way to discover each other, whether that is through a shared-filesystem approach, or through a main IP address that every node can talk to.

Here are some helpful tutorials around PyTorch distributed and DDP:
PyTorch DDP tutorial: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
PyTorch distributed overview: https://pytorch.org/tutorials/beginner/dist_overview.html

1 Like