Documentation for torchrun
says about parameter --nnodes=1:4
, that allows to run training on one, two, three or four PCs, each of which has $NUM_TRAINERS
GPU.
Is it possible to add or remove new trainers to a group on a single node?
My workstation has two GPUs, and I am using GPU-capable fork of task-spooler for job queue. I’d like to use one GPU for development during working days and reuse it for processing over nights.