I am attempting to replicate (not reproduce so I don’t need identical seeding) the training of vision models provided in vision.references.classification
. According to the ReadMe, these models have been trained in a distributed way across 8 GPUs (i.e. --nproc_per_node=8
) while I would like to train using identical hyperparameters on a single (or at least fewer) GPUs.
Therefore, I would like to change the command from e.g.
torchrun --nproc_per_node=8 train.py --batch_size 32 --model resnet18
to
torchrun --nproc_per_node=1 train.py --batch_size 32 --model resnet18
However it is not clear to me if this is equivilent as it depends on whether model.optimize()
is carried out within each batch or not. If not (which I believe to be the case?) the effective batch size would change from 32 * 8 to 32 * 1 thereby effecting the implicit bias of stochastic gradient descent.
Therefore I suspect the solution to match these experiments consists of multiplying the batch size and instead running:
torchrun --nproc_per_node=1 train.py --batch_size 256 --model resnet18
Is this correct?
And as a follow up question, suppose that this increased batch size requires too much memory from my machine. The solution I have come up with in this case is to only apply torch.optimize()
every 8th batch. Specifically I would run:
torchrun --nproc_per_node=1 train.py --batch_size 32 --model resnet18
but I would additionally change the source code in vision.references.classification.train.py
in the training logic (L46) from:
optimizer.step()
to:
if (i + 1) % 8. == 0:
optimizer.step()
Again, is this equivilent to the original hyperparameters implemented? Is this the best solution?