Hello there, I am student and we could say beginner to the topic of machine learning.
I am currently looking into problematic of parallel training on multiple GPUs. I understand DataParallel, but cant make Distributed Data Parallel works.
The part I dont understand is communication through backend and connecting two nodes, for example do they need to be on same cluster? Or is static IP enough for master node. Can I somehow use public IP, and how?
My question really is, if you could provide me some good sources of knowledge, tutorials, etc.
I would be really grateful.