DataLoader batch_size in DDP for multi-gpu and multi-node

Hi,

I want to set the batch size to 128.
When I use 8 gpus ( 4 gpus / node and 2 nodes ) - so world_size=8, ngpus_per_node=4 -
what number do I need to set to batch_size= in DataLoader?
(I run torchrun with DDP.)

128/world_size = 16 or 128/ngpus_per_node = 32 or 128 ?

Thank you!