Back-propagation through time and distributed training

I have a Sampler to do backpropagation through time, similar to the one in the torchnlp examples. This requires giving the batch size of training as an argument so that batches are properly continuous (e.g. if the dataset is [abcdefghi] and batch size is 3, the batches are [adf] [beh] [cfi]).

My question is, how does this work in a distributed setting? From reading the code of the distributed sampler, I got the impression that each process gets its copy of the sampler. In this case, I would need to know the local batch size in every process to properly separate batches. Is there a general rule to determine batch size per process (like an equal portion per GPU?), and if not, how could one determine the local batch size?

Is there a general rule to determine batch size per process?

Yes, if possible, it’s better that the data is evenly distributed on different processes, otherwise the process with lighter workload will have to frequently waiting for the stragglers, causing unnecessary slowdown.

Compare to that, if you are using DistributedDataParallel (DDP) a more important thing is that all processes must have the same number of forward/backward iterations, otherwise collective communications in DDP backward would hang.

how could one determine the local batch size?

See discussion here: Should we split batch_size according to ngpu_per_node when DistributedDataparallel