Should we split batch_size according to ngpu_per_node when DistributedDataparallel

Is it correct that when local batch-size is 64 (i.e. torch.utils.data.DataLoader(batch_size=64) and torch.utils.data.distributed.DistributedSampler() is used), and there are N processes totally in ddp (N processes distirbute in one node or more than one node), the forward-backward process is similar to the forward-backward process in 1-gpu-1-node using 64×N batch-size inputs?