DataLoader num_workers in a multi-node distributed setup

I am following the setup used here: examples/imagenet at ee964a2eeb41e1712fe719b83645c79bcbd0ba1a · pytorch/examples · GitHub and I am trying to run it on 2 nodes, with 4 GPUs each.

When using DistributedDataParallel, the example calculates num_workers as follows:

workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)

Otherwise, it uses the number entered by the user, with 4 as the default.

I am not sure why they followed this calculation.
What are the guidelines for setting the number of workers, especially in the distributed setup?

Thank you