Min/max in RENDEZVOUS and world_size

HuangLED · June 17, 2021, 7:21pm

Trying to understand the connection between these two concepts, after reading this page: Rendezvous — PyTorch 1.9.0 documentation

Do we need this min/max if I already know the exact number machines under my control?

Let us say I have two dedicated machines that I want to use for training. That means I want world_size=2. Then both min and max should be exactly 2 for the intended allocation, right? In what kind of scenario one would want to set min/max differently to take advantage of this flexibility?

gcramer23 · June 21, 2021, 1:43pm

Let us say I have two dedicated machines that I want to use for training. That means I want world_size=2. Then both min and max should be exactly 2 for the intended allocation, right? In what kind of scenario one would want to set min/max differently to take advantage of this flexibility?

If you have 2 machines you can use for training, then set the max to 2. If you are willing to start training with 1 machine, then setting min to 1 makes sense. You can set min to a different value than max when you are willing to start a training job with less machines due to faults in the system.

cc @cbalioglu

cbalioglu · June 22, 2021, 4:50pm

In traditional HPC and ML systems all nodes that are part of a job are started simultaneously (as you implied in your question) and the job gets dispatched (by a scheduler) once all nodes are ready to execute. The set of nodes that are part of a job are usually called a “gang”.

TorchElastic is designed in a way to handle systems that do not form a gang. Those systems are usually meant to run traditional distributed applications where the scheduler has no concept of forming a gang of nodes (e.g. the scheduler simply starts executing the job on a node the moment the node becomes available). Since you have no formal gang you have to have some form of mechanism to simulate a gang (a “pseudo-gang”) and dispatch the job at some point in time after the user requested it. This is where the minimum and maximum number of nodes come into play. You basically tell TorchElastic that it should wait until at least min number of nodes become available to execute a job, but that ideally you want it to have max number of jobs for the execution. Once TorchElastic reaches min number of nodes, it sets an internal “lastcall” timer and continues accepting new nodes until the timer expires or the max number of nodes is reached. At that point TorchElastic dispatches the job on all participant nodes. In summary TorchElastic has a built-in capability to form a gang even when run on systems that has no such concept.

In your particular case (and in most cases) setting minimum and maximum number of nodes to the same value is the way to go. They become relevant in non-traditional execution environments as I described above.