Replicating training with single gpu and adjusting batch size

I am attempting to replicate (not reproduce so I don’t need identical seeding) the training of vision models provided in vision.references.classification. According to the ReadMe, these models have been trained in a distributed way across 8 GPUs (i.e. --nproc_per_node=8) while I would like to train using identical hyperparameters on a single (or at least fewer) GPUs.

Therefore, I would like to change the command from e.g.
torchrun --nproc_per_node=8 train.py --batch_size 32 --model resnet18
to
torchrun --nproc_per_node=1 train.py --batch_size 32 --model resnet18

However it is not clear to me if this is equivilent as it depends on whether model.optimize() is carried out within each batch or not. If not (which I believe to be the case?) the effective batch size would change from 32 * 8 to 32 * 1 thereby effecting the implicit bias of stochastic gradient descent.

Therefore I suspect the solution to match these experiments consists of multiplying the batch size and instead running:

torchrun --nproc_per_node=1 train.py --batch_size 256 --model resnet18

Is this correct?

And as a follow up question, suppose that this increased batch size requires too much memory from my machine. The solution I have come up with in this case is to only apply torch.optimize() every 8th batch. Specifically I would run:

torchrun --nproc_per_node=1 train.py --batch_size 32 --model resnet18

but I would additionally change the source code in vision.references.classification.train.py in the training logic (L46) from:

optimizer.step()

to:

if (i + 1) % 8. == 0:
   optimizer.step()

Again, is this equivilent to the original hyperparameters implemented? Is this the best solution?