I am attempting to follow this tutorial to train a neural network over multiple GPUs on the same windows machine (using gloo), and do not fully understand the reasons why the example code is structured in the way presented. Does anyone know any other good tutorials on how to do this? I’m trying to utilize all my GPUs on my computer and do not fully understand how to write a full training and testing loop using DPP.
I’ve read several tutorials, but still have a few questions:
Is it possible to use DPP in a jupyter notebook?
why do all the functions take “rank” as an argument, but never define this value anywhere?
If you have your own boilerplate method for training a neural network in a loop how do you incorporate these techniques? do you only parallelize the model and let the computer determine where to send it, or do you need to parallelize everything as outlined in this tutorial?
How would you “setup” your environment to access GPUs on the same computer? How do you know what port to use? Also, how is the “init_method” file used in windows computers? How would you pick one?
Even though Pytorch does not recommend using data parallel over DPP, is there a way to use it with a gloo backend? It seems easier to implement than the DPP method, and for that reason it might be easier to use for newbies.
Thank you,
Joe