Stacked vs. eponymous torchrun cli options

Docs here: torchrun (Elastic Launch) — PyTorch 2.0 documentation

In the Pytorch docs for torchrun, it lists two options for single-node multi-worker training: “Single-node multi-worker” and “Stacked single-node multi-worker”.

For me the “single-node multi-worker” did not work as intended but the “Stacked single-node multi-worker” training worked exactly as expected. “single-node multi-worker” training only processed on one GPU while the “stacked” version of the command engaged all available GPUs. What are the intended differences and use cases of these two options?