Insufficient shared memory when using distributed training nccl backend


I’m curious does nccl backend for distributed training requires additional shared memory?

When using 1 machine with 4 gpus and 24 cpu cores to training ImageNet, setting batchsize to 256 and the number to workers to 20 has no problem. However, when I train ImageNet on 2 machines each with 4 gpus and 24 cpu cores, setting batchsize to 512 and workers to 20 will cause insufficient shared memory after some epochs of training.

Any idea what might be the problem and any guidance on how to set the number of workers for distributed training?