When doing data_loader = DataLoader(my_dataset, sampler=DistributedSampler(dataset), batch_size=N) in a DDP distributed training script, what is the number of records each GPU/worker/process/script (unsure what is the most accepted name) receives at each iteration?
Does each DDP GPU receive N records, or N/gpu_count?