Adding/removing new trainers on the single node to elastic training

wl2776 · February 21, 2025, 9:04am

Documentation for torchrun says about parameter --nnodes=1:4, that allows to run training on one, two, three or four PCs, each of which has $NUM_TRAINERS GPU.

Is it possible to add or remove new trainers to a group on a single node?

My workstation has two GPUs, and I am using GPU-capable fork of task-spooler for job queue. I’d like to use one GPU for development during working days and reuse it for processing over nights.