Docs here: torchrun (Elastic Launch) — PyTorch 2.0 documentation
In the Pytorch docs for torchrun, it lists two options for single-node multi-worker training: “Single-node multi-worker” and “Stacked single-node multi-worker”.
For me the “single-node multi-worker” did not work as intended but the “Stacked single-node multi-worker” training worked exactly as expected. “single-node multi-worker” training only processed on one GPU while the “stacked” version of the command engaged all available GPUs. What are the intended differences and use cases of these two options?